Words List (frequency)
# | word (frequency) | phonetic | sentence |
1 | ConvNet (57) | |
- We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision.我们使我们的两个性能最好的ConvNet模型可公开获得,以便进一步研究计算机视觉中深度视觉表示的使用。
- With ConvNets becoming more of a commodity in the computer vision field, a number of attempts have been made to improve the original architecture of Krizhevsky et al. (2012) in a bid to achieve better accuracy.随着ConvNets在计算机视觉领域越来越商品化,为了达到更好的准确性,已经进行了许多尝试来改进Krizhevsky等人(2012)最初的架构。
- In this paper, we address another important aspect of ConvNet architecture design —— its depth.在本文中,我们解决了ConvNet架构设计的另一个重要方面——其深度。
- As a result, we come up with significantly more accurate ConvNet architectures, which not only achieve the state-of-the-art accuracy on ILSVRC classification and localisation tasks, but are also applicable to other image recognition datasets, where they achieve excellent performance even when used as a part of a relatively simple pipelines (e.g. deep features classified by a linear SVM without fine-tuning).因此,我们提出了更为精确的ConvNet架构,不仅可以在ILSVRC分类和定位任务上取得的最佳的准确性,而且还适用于其它的图像识别数据集,它们可以获得优异的性能,即使使用相对简单流程的一部分(例如,通过线性SVM分类深度特征而不进行微调)。
- In Sect. 2, we describe our ConvNet configurations.在第2节,我们描述了我们的ConvNet配置。
- 2 CONVNET CONFIGURATIONS2. ConvNet配置
- To measure the improvement brought by the increased ConvNet depth in a fair setting, all our ConvNet layer configurations are designed using the same principles, inspired by Ciresan et al. (2011); Krizhevsky et al. (2012).为了衡量ConvNet深度在公平环境中所带来的改进,我们所有的ConvNet层配置都使用相同的规则,灵感来自Ciresan等(2011);Krizhevsky等人(2012年)。
- To measure the improvement brought by the increased ConvNet depth in a fair setting, all our ConvNet layer configurations are designed using the same principles, inspired by Ciresan et al. (2011); Krizhevsky et al. (2012).为了衡量ConvNet深度在公平环境中所带来的改进,我们所有的ConvNet层配置都使用相同的规则,灵感来自Ciresan等(2011);Krizhevsky等人(2012年)。
- In this section, we first describe a generic layout of our ConvNet configurations (Sect. 2.1) and then detail the specific configurations used in the evaluation (Sect. 2.2).在本节中,我们首先描述我们的ConvNet配置的通用设计(第2.1节),然后详细说明评估中使用的具体配置(第2.2节)。
- During training, the input to our ConvNets is a fixed-size 224 × 224 RGB image.在训练期间,我们的ConvNet的输入是固定大小的224×224 RGB图像。
- The ConvNet configurations, evaluated in this paper, are outlined in Table 1, one per column.本文中评估的ConvNet配置在表1中列出,每列一个。
- Table 1: ConvNet configurations (shown in columns).表1:ConvNet配置(以列显示)。
- Our ConvNet configurations are quite different from the ones used in the top-performing entries of the ILSVRC-2012 (Krizhevsky et al. , 2012) and ILSVRC-2013 competitions (Zeiler & Fergus, 2013; Sermanet et al. , 2014).我们的ConvNet配置与ILSVRC-2012(Krizhevsky等,2012)和ILSVRC-2013比赛(Zeiler&Fergus,2013;Sermanet等,2014)表现最佳的参赛提交中使用的ConvNet配置有很大不同。
- Goodfellow et al. (2014) applied deep ConvNets (11 weight layers) to the task of street number recognition, and showed that the increased depth led to better performance.Goodfellow等人(2014)在街道号识别任务中采用深层ConvNets(11个权重层),显示出增加的深度导致了更好的性能。
- GoogLeNet (Szegedy et al. , 2014), a top-performing entry of the ILSVRC-2014 classification task, was developed independently of our work, but is similar in that it is based on very deep ConvNets(22 weight layers) and small convolution filters (apart from 3 × 3, they also use 1 × 1 and 5 × 5 convolutions).GooLeNet(Szegedy等,2014),ILSVRC-2014分类任务的表现最好的项目,是独立于我们工作之外的开发的,但是类似的是它是基于非常深的ConvNets(22个权重层)和小卷积滤波器(除了3×3,它们也使用了1×1和5×5卷积)。
- In this section, we describe the details of classification ConvNet training and evaluation.在本节中,我们将介绍分类ConvNet训练和评估的细节。
- The ConvNet training procedure generally follows Krizhevsky et al. (2012) (except for sampling the input crops from multi-scale training images, as explained later).ConvNet训练过程通常遵循Krizhevsky等人(2012)(除了从多尺度训练图像中对输入裁剪图像进行采样外,如下文所述)。
- To obtain the fixed-size 224×224 ConvNet input images, they were randomly cropped from rescaled training images (one crop per image per SGD iteration).为了获得固定大小的224×224 ConvNet输入图像,它们从归一化的训练图像中被随机裁剪(每个图像每次SGD迭代进行一次裁剪)。
- Let S be the smallest side of an isotropically-rescaled training image, from which the ConvNet input is cropped (we also refer to S as the training scale).令S是等轴归一化的训练图像的最小边,ConvNet输入从S中裁剪(我们也将S称为训练尺度)。
- Given a ConvNet configuration, we first trained the network using S = 256.给定ConvNet配置,我们首先使用S=256来训练网络。
- At test time, given a trained ConvNet and an input image, it is classified in the following way.在测试时,给出训练的ConvNet和输入图像,它按以下方式分类。
- Also, multi-crop evaluation is complementary to dense evaluation due to different convolution boundary conditions: when applying a ConvNet to a crop, the convolved feature maps are padded with zeros, while in the case of dense evaluation the padding for the same crop naturally comes from the neighbouring parts of an image (due to both the convolutions and spatial pooling), which substantially increases the overall network receptive field, so more context is captured.此外,由于不同的卷积边界条件,多裁剪图像评估是密集评估的补充:当将ConvNet应用于裁剪图像时,卷积特征图用零填充,而在密集评估的情况下,相同裁剪图像的填充自然会来自于图像的相邻部分(由于卷积和空间池化),这大大增加了整个网络的感受野,因此捕获了更多的上下文。
- While more sophisticated methods of speeding up ConvNet training have been recently proposed (Krizhevsky, 2014), which employ model and data parallelism for different layers of the net, we have found that our conceptually much simpler scheme already provides a speedup of 3.75 times on an off-the-shelf 4-GPU system, as compared to using a single GPU.最近提出了更加复杂的加速ConvNet训练的方法(Krizhevsky,2014),它们对网络的不同层之间采用模型和数据并行,我们发现我们概念上更简单的方案与使用单个GPU相比,在现有的4-GPU系统上已经提供了3.75倍的加速。
- In this section, we present the image classification results achieved by the described ConvNet architectures on the ILSVRC-2012 dataset (which was used for ILSVRC 2012–2014 challenges).在本节中,我们介绍了描述的ConvNet架构(用于ILSVRC 2012-2014挑战)在ILSVRC-2012数据集上实现的图像分类结果。
- We begin with evaluating the performance of individual ConvNet models at a single scale with the layer configurations described in Sect. 2.2.我们首先评估单个ConvNet模型在单尺度上的性能,其层结构配置如2.2节中描述。
- Table 3: ConvNet performance at a single test scale.表3:在单测试尺度的ConvNet性能
- Second, we observe that the classification error decreases with the increased ConvNet depth: from 11 layers in A to 19 layers in E.第二,我们观察到分类误差随着ConvNet深度的增加而减小:从A中的11层到E中的19层。
- Having evaluated the ConvNet models at a single scale, we now assess the effect of scale jittering at test time.在单尺度上评估ConvNet模型后,我们现在评估测试时尺度抖动的影响。
- Table 4: ConvNet performance at multiple test scales.表4:在多个测试尺度上的ConvNet性能
- In Table 5 we compare dense ConvNet evaluation with mult-crop evaluation (see Sect. 3.2 for details).在表5中,我们将稠密ConvNet评估与多裁剪图像评估进行比较(细节参见第3.2节)。
- Table 5: ConvNet evaluation techniques comparison.表5:ConvNet评估技术比较。
- 4.4 CONVNET FUSION4.4 卷积网络融合
- Up until now, we evaluated the performance of individual ConvNet models.到目前为止,我们评估了ConvNet模型的性能。
- Table 6: Multiple ConvNet fusion results.表6:多个卷积网络融合结果
- As can be seen from Table 7, our very deep ConvNets significantly outperform the previous generation of models, which achieved the best results in the ILSVRC-2012 and ILSVRC-2013 competitions.从表7可以看出,我们非常深的ConvNets显著优于前一代模型,在ILSVRC-2012和ILSVRC-2013竞赛中取得了最好的结果。
- Notably, we did not depart from the classical ConvNet architecture of LeCun et al. (1989), but improved it by substantially increasing the depth.值得注意的是,我们并没有偏离LeCun(1989)等人经典的ConvNet架构,但通过大幅增加深度改善了它。
- It was demonstrated that the representation depth is beneficial for the classification accuracy, and that state-of-the-art performance on the ImageNet challenge dataset can be achieved using a conventional ConvNet architecture (LeCun et al. , 1989; Krizhevsky et al. , 2012) with substantially increased depth.已经证明,表示深度有利于分类精度,并且深度大大增加的传统ConvNet架构(LeCun等,1989;Krizhevsky等,2012)可以实现ImageNet挑战数据集上的最佳性能。
- In the main body of the paper we have considered the classification task of the ILSVRC challenge, and performed a thorough evaluation of ConvNet architectures of different depth.在论文的主体部分,我们考虑了ILSVRC挑战的分类任务,并对不同深度的ConvNet架构进行了深入的评估。
- A.1 LOCALISATION CONVNETA.1 ConvNet定位
- To perform object localisation, we use a very deep ConvNet, where the last fully connected layer predicts the bounding box location instead of the class scores.为了进行目标定位,我们使用非常深的ConvNet,其中最后一个完全连接的层预测边界框位置,而不是类别分数。
- Apart from the last bounding box prediction layer, we use the ConvNet architecture D (Table 1), which contains 16 weight layers and was found to be the best-performing in the classification task (Sect. 4).除了最后一个边界框预测层,我们使用ConvNet体系结构D(表1),它包含16个权重层,并且被发现在分类任务中表现最好(Sect. 4)。
- Training of localisation ConvNets is similar to that of the classification ConvNets(Sect. 3.1).定位网络ConvNets的训练类似于分类ConvNets(Sect. 3.1)。
- Training of localisation ConvNets is similar to that of the classification ConvNets(Sect. 3.1).定位网络ConvNets的训练类似于分类ConvNets(Sect. 3.1)。
- The second, fully-fledged, testing procedure is based on the dense application of the localisation ConvNet to the whole image, similarly to the classification task (Sect. 3.2).第二个全面的测试程序基于定位网络ConvNet对整个图像的密集应用,类似于分类任务(3.2节)。
- To come up with the final prediction, we utilise the greedy merging procedure of Sermanet et al. (2014), which first merges spatially close predictions (by averaging their coordinates), and then rates them based on the class scores, obtained from the classification ConvNet.为了得到最终的预测,我们利用Sermanet等人(2014)的贪婪合并过程,它首先合并空间上接近的预测(通过平均它们的坐标),然后基于从分类ConvNet获得的类别得分对它们进行评级。
- When several localisation ConvNets are used, we first take the union of their sets of bounding box predictions, and then run the merging procedure on the union.当使用几个本地化ConvNet时,我们首先获取它们的边界框预测集的并集,然后对该并集运行合并过程。
- All ConvNet layers (except for the last one) have the configuration D (Table 1), while the last layer performs either single-class regression (SCR) or per-class regression (PCR).所有ConvNet层(最后一层除外)都使用配置D(表1),而最后一层执行单类回归(SCR)或逐类回归(PCR)。
- As can be seen from Table 9, application of the localisation ConvNet to the whole image substantially improves the results compared to using a center crop (Table 8), despite using the top-5 predicted class labels instead of the ground truth.从表9可以看出,与使用中心裁剪(表8)相比,将本地化ConvNet应用于整个图像显著改善了结果,尽管使用了前5个预测的类别标签而不是真实值。
- This indicates the performance advancement brought by our very deep ConvNets – we got better results with a simpler localisation method, but a more powerful representation.这表明我们非常深入的ConvNets带来的性能提升-我们使用更简单的定位方法获得了更好的结果,但更强大的表示。
- In the previous sections we have discussed training and evaluation of very deep ConvNets on the ILSVRC dataset.在前面的部分中,我们讨论了ILSVRC数据集上非常深的ConvNet的训练和评估。
- In this section, we evaluate our ConvNets, pre-trained on ILSVRC, as feature extractors on other, smaller, datasets, where training large models from scratch is not feasible due to over-fitting.在本节中,我们将在ILSVRC上预训练的ConvNets评估为其他较小数据集上的特征提取器,其中由于过度拟合,从头训练大型模型是不可行的。
- To utilise the ConvNets, pre-trained on ILSVRC, for image classification on other datasets, we remove the last fully-connected layer (which performs 1000-way ILSVRC classification), and use 4096-D activations of the penultimate layer as image features, which are aggregated across multiple locations and scales.为了利用在ILSVRC上预先训练的ConvNets对其他数据集进行图像分类,我们删除了最后一个完全连接的层(它执行1000种ILSVRC分类),并使用倒数第二层的4096-D激活作为图像特征,这些图像特征在多个位置和规模上聚合。
- For simplicity, pre-trained ConvNet weights are kept fixed (no fine-tuning is performed).为简单起见,预先训练的ConvNet权重保持固定(不执行微调)。
- Results marked with * were achieved using ConvNets pre-trained on the extended ILSVRC dataset (2000 classes).使用在扩展ILSVRC数据集(2000个类)上预训练的ConvNets获得标有*的结果。
- We considered two training settings: (i) computing the ConvNet features on the whole image and ignoring the provided bounding box; (ii) computing the features on the whole image and on the provided bounding box, and stacking them to obtain the final representation.我们考虑了两个训练设置:(I)计算整个图像上的凸网特征并忽略提供的边界框;(Ii)计算整个图像和提供的边界框上的特征,并将它们堆叠以获得最终表示。
- For instance, Girshick et al. (2014) achieve the state of the object detection results by replacing the ConvNet of Krizhevsky et al. (2012) with our 16-layer model.例如,Girshick等人(2014)通过使用我们的16层模型替换Krizhevsky等人(2012)的ConvNet来实现对象检测结果的状态。
- Results marked with * were achieved using ConvNets pre-trained on the extended ILSVRC dataset (1512 classes).使用在扩展ILSVRC数据集(1512个类)上预训练的ConvNets获得标有*的结果。
|
2 | ILSVRC (44) | [!≈ aɪ el es vi: ɑ:(r) si:] |
- For instance, the best-performing submissions to the ILSVRC-2013 (Zeiler & Fergus, 2013; Sermanet et al. , 2014) utilised smaller receptive window size and smaller stride of the first convolutional layer.例如,ILSVRC-2013(Zeiler&Fergus,2013;Sermanet等,2014)表现最佳的提交使用了更小的感受窗口尺寸和更小的第一卷积层步长。
- As a result, we come up with significantly more accurate ConvNet architectures, which not only achieve the state-of-the-art accuracy on ILSVRC classification and localisation tasks, but are also applicable to other image recognition datasets, where they achieve excellent performance even when used as a part of a relatively simple pipelines (e.g. deep features classified by a linear SVM without fine-tuning).因此,我们提出了更为精确的ConvNet架构,不仅可以在ILSVRC分类和定位任务上取得的最佳的准确性,而且还适用于其它的图像识别数据集,它们可以获得优异的性能,即使使用相对简单流程的一部分(例如,通过线性SVM分类深度特征而不进行微调)。
- The details of the image classification training and evaluation are then presented in Sect. 3, and the configurations are compared on the ILSVRC classification task in Sect. 4.图像分类训练和评估的细节在第3节,并在第4节中在ILSVRC分类任务上对配置进行了比较。
- For completeness, we also describe and assess our ILSVRC-2014 object localisation system in Appendix A, and discuss the generalisation of very deep features to other datasets in Appendix B.为了完整起见,我们还将在附录A中描述和评估我们的ILSVRC-2014目标定位系统,并在附录B中讨论了非常深的特征在其它数据集上的泛化。
- A stack of convolutional layers (which has a different depth in different architectures) is followed by three Fully-Connected (FC) layers: the first two have 4096 channels each, the third performs 1000-way ILSVRC classification and thus contains 1000 channels (one for each class).一堆卷积层(在不同架构中具有不同深度)之后是三个全连接(FC)层:前两个每个都有4096个通道,第三个执行1000维ILSVRC分类,因此包含1000个通道(一个通道对应一个类别)。
- We note that none of our networks (except for one) contain Local Response Normalisation (LRN) normalisation (Krizhevsky et al. , 2012): as will be shown in Sect. 4, such normalisation does not improve the performance on the ILSVRC dataset, but leads to increased memory consumption and computation time.我们注意到,我们的网络(除了一个)都不包含局部响应规范化(LRN)(Krizhevsky等,2012):将在第4节看到,这种规范化并不能提高在ILSVRC数据集上的性能,但增加了内存消耗和计算时间。
- Our ConvNet configurations are quite different from the ones used in the top-performing entries of the ILSVRC-2012 (Krizhevsky et al. , 2012) and ILSVRC-2013 competitions (Zeiler & Fergus, 2013; Sermanet et al. , 2014).我们的ConvNet配置与ILSVRC-2012(Krizhevsky等,2012)和ILSVRC-2013比赛(Zeiler&Fergus,2013;Sermanet等,2014)表现最佳的参赛提交中使用的ConvNet配置有很大不同。
- Our ConvNet configurations are quite different from the ones used in the top-performing entries of the ILSVRC-2012 (Krizhevsky et al. , 2012) and ILSVRC-2013 competitions (Zeiler & Fergus, 2013; Sermanet et al. , 2014).我们的ConvNet配置与ILSVRC-2012(Krizhevsky等,2012)和ILSVRC-2013比赛(Zeiler&Fergus,2013;Sermanet等,2014)表现最佳的参赛提交中使用的ConvNet配置有很大不同。
- Small-size convolution filters have been previously used by Ciresan et al. (2011), but their nets are significantly less deep than ours, and they did not evaluate on the large-scale ILSVRC dataset.Ciresan等人(2011)以前使用小尺寸的卷积滤波器,但是他们的网络深度远远低于我们的网络,他们并没有在大规模的ILSVRC数据集上进行评估。
- GoogLeNet (Szegedy et al. , 2014), a top-performing entry of the ILSVRC-2014 classification task, was developed independently of our work, but is similar in that it is based on very deep ConvNets(22 weight layers) and small convolution filters (apart from 3 × 3, they also use 1 × 1 and 5 × 5 convolutions).GooLeNet(Szegedy等,2014),ILSVRC-2014分类任务的表现最好的项目,是独立于我们工作之外的开发的,但是类似的是它是基于非常深的ConvNets(22个权重层)和小卷积滤波器(除了3×3,它们也使用了1×1和5×5卷积)。
- In this section, we present the image classification results achieved by the described ConvNet architectures on the ILSVRC-2012 dataset (which was used for ILSVRC 2012–2014 challenges).在本节中,我们介绍了描述的ConvNet架构(用于ILSVRC 2012-2014挑战)在ILSVRC-2012数据集上实现的图像分类结果。
- In this section, we present the image classification results achieved by the described ConvNet architectures on the ILSVRC-2012 dataset (which was used for ILSVRC 2012–2014 challenges).在本节中,我们介绍了描述的ConvNet架构(用于ILSVRC 2012-2014挑战)在ILSVRC-2012数据集上实现的图像分类结果。
- The former is a multi-class classification error, i.e. the proportion of incorrectly classified images; the latter is the main evaluation criterion used in ILSVRC, and is computed as the proportion of images such that the ground-truth category is outside the top-5 predicted categories.前者是多类分类误差,即不正确分类图像的比例;后者是ILSVRC中使用的主要评估标准,并且计算为图像真实类别在前5个预测类别之外的图像比例。
- Certain experiments were also carried out on the test set and submitted to the official ILSVRC server as a “VGG” team entry to the ILSVRC-2014 competition (Russakovsky et al. , 2014).在测试集上也进行了一些实验,并将其作为ILSVRC-2014竞赛(Russakovsky等,2014)“VGG”小组的输入提交到了官方的ILSVRC服务器。
- Certain experiments were also carried out on the test set and submitted to the official ILSVRC server as a “VGG” team entry to the ILSVRC-2014 competition (Russakovsky et al. , 2014).在测试集上也进行了一些实验,并将其作为ILSVRC-2014竞赛(Russakovsky等,2014)“VGG”小组的输入提交到了官方的ILSVRC服务器。
- This improves the performance due to complementarity of the models, and was used in the top ILSVRC submissions in 2012 (Krizhevsky et al. , 2012) and 2013 (Zeiler & Fergus, 2013; Sermanet et al. , 2014).由于模型的互补性,这提高了性能,并且在了2012年(Krizhevsky等,2012)和2013年(Zeiler&Fergus,2013;Sermanet等,2014)ILSVRC的顶级提交中使用。
- By the time of ILSVRC submission we had only trained the single-scale networks, as well as a multi-scale model D (by fine-tuning only the fully-connected layers rather than all layers).在ILSVRC提交的时候,我们只训练了单规模网络,以及一个多尺度模型D(仅在全连接层进行微调而不是所有层)。
- The resulting ensemble of 7 networks has 7.3% ILSVRC test error.由此产生的7个网络组合具有7.3%的ILSVRC测试误差。
- In the classification task of ILSVRC-2014 challenge (Russakovsky et al. , 2014), our “VGG” team secured the 2nd place with 7.3% test error using an ensemble of 7 models.在ILSVRC-2014挑战的分类任务(Russakovsky等,2014)中,我们的“VGG”团队获得了第二名,使用7个模型的组合取得了7.3%测试误差。
- Table 7: Comparison with the state of the art in ILSVRC classification.表7:在ILSVRC分类中与最新技术比较。
- As can be seen from Table 7, our very deep ConvNets significantly outperform the previous generation of models, which achieved the best results in the ILSVRC-2012 and ILSVRC-2013 competitions.从表7可以看出,我们非常深的ConvNets显著优于前一代模型,在ILSVRC-2012和ILSVRC-2013竞赛中取得了最好的结果。
- As can be seen from Table 7, our very deep ConvNets significantly outperform the previous generation of models, which achieved the best results in the ILSVRC-2012 and ILSVRC-2013 competitions.从表7可以看出,我们非常深的ConvNets显著优于前一代模型,在ILSVRC-2012和ILSVRC-2013竞赛中取得了最好的结果。
- Our result is also competitive with respect to the classification task winner (GoogLeNet with 6.7% error) and substantially outperforms the ILSVRC-2013 winning submission Clarifai, which achieved 11.2% with outside training data and 11.7% without it.我们的结果对于分类任务获胜者(GoogLeNet具有6.7%的错误率)也具有竞争力,并且大大优于ILSVRC-2013获胜者Clarifai的提交,其使用外部训练数据取得了11.2%的错误率,没有外部数据则为11.7%。
- This is remarkable, considering that our best result is achieved by combining just two models —— significantly less than used in most ILSVRC submissions.这是非常显著的,考虑到我们最好的结果是仅通过组合两个模型实现的——明显少于大多数ILSVRC提交。
- In the main body of the paper we have considered the classification task of the ILSVRC challenge, and performed a thorough evaluation of ConvNet architectures of different depth.在论文的主体部分,我们考虑了ILSVRC挑战的分类任务,并对不同深度的ConvNet架构进行了深入的评估。
- For this we adopt the approach of Sermanet et al. (2014), the winners of the ILSVRC-2013 localisation challenge, with a fewmodifications.为此,我们采用Sermanet等人(2014)的方法,仅作了几处修改。Sermanet等人是ILSVRC-2013定位挑战的获胜者。
- We trained two localisation models, each on a single scale: S = 256 and S = 384 (due to the time constraints, we did not use training scale jittering for our ILSVRC-2014 submission).我们训练了两个定位模型,每个模型都在单个规模上:S=256和S=384(由于时间限制,我们没有在ILSVRC-2014提交中使用训练规模抖动)。
- The localisation error is measured according to the ILSVRC criterion (Russakovsky et al. , 2014), i.e. the bounding box prediction is deemed correct if its intersection over union ratio with the ground-truth bounding box is above 0.5.根据ILSVRC标准测量定位误差(Russakovsky等人,2014),即如果边界框预测与实际边界框的相交超过并比大于0.5,则认为其是正确的。
- With 25.3% test error, our “VGG” team won the localisation challenge of ILSVRC-2014 (Russakovsky et al. , 2014).以25.3%的测试误差,我们的“VGG”团队赢得了ILSVRC-2014(Russakovsky等,2014)的本地化挑战。
- Notably, our results are considerably better than those of the ILSVRC-2013 winner Overfeat (Sermanet et al. , 2014), even though we used less scales and did not employ their resolution enhancement technique.值得注意的是,我们的结果比ILSVRC-2013获奖者Overfeat(Sermanet等人,2014)的结果要好得多,尽管我们使用了更少的比例并且没有使用他们的分辨率增强技术。
- Table 10: Comparison with the state of the art in ILSVRC localisation.表10:与ILSVRC本地化技术的比较。
- In the previous sections we have discussed training and evaluation of very deep ConvNets on the ILSVRC dataset.在前面的部分中,我们讨论了ILSVRC数据集上非常深的ConvNet的训练和评估。
- In this section, we evaluate our ConvNets, pre-trained on ILSVRC, as feature extractors on other, smaller, datasets, where training large models from scratch is not feasible due to over-fitting.在本节中,我们将在ILSVRC上预训练的ConvNets评估为其他较小数据集上的特征提取器,其中由于过度拟合,从头训练大型模型是不可行的。
- Recently, there has been a lot of interest in such a use case (Zeiler & Fergus, 2013; Donahue et al. , 2013; Razavian et al. , 2014; Chatfield et al. , 2014), as it turns out that deep image representations, learnt on ILSVRC, generalise well to other datasets, where they have outperformed hand-crafted representations by a large margin.最近,人们对这样一个用例非常感兴趣(Zeiler&Fergus,2013;Donahue等人,2013;Razvian等人,2014;Chatfield等人,2014),因为事实证明,在ILSVRC上学习的深层图像表示很好地推广到其他数据集,在这些数据集中,它们的表现远远超过手工表示。
- In this evaluation, we consider two models with the best classification performance on ILSVRC(Sect. 4) – configurations “Net-D” and “Net-E” (which we made publicly available).在这次评估中,我们考虑了在ILSVRC上具有最佳分类性能的两个模型(Sect.4)-配置“Net-D”和“Net-E”(我们公开提供)。
- To utilise the ConvNets, pre-trained on ILSVRC, for image classification on other datasets, we remove the last fully-connected layer (which performs 1000-way ILSVRC classification), and use 4096-D activations of the penultimate layer as image features, which are aggregated across multiple locations and scales.为了利用在ILSVRC上预先训练的ConvNets对其他数据集进行图像分类,我们删除了最后一个完全连接的层(它执行1000种ILSVRC分类),并使用倒数第二层的4096-D激活作为图像特征,这些图像特征在多个位置和规模上聚合。
- To utilise the ConvNets, pre-trained on ILSVRC, for image classification on other datasets, we remove the last fully-connected layer (which performs 1000-way ILSVRC classification), and use 4096-D activations of the penultimate layer as image features, which are aggregated across multiple locations and scales.为了利用在ILSVRC上预先训练的ConvNets对其他数据集进行图像分类,我们删除了最后一个完全连接的层(它执行1000种ILSVRC分类),并使用倒数第二层的4096-D激活作为图像特征,这些图像特征在多个位置和规模上聚合。
- Aggregation of features is carried out in a similar manner to our ILSVRC evaluation procedure (Sect. 3.2).特征的聚合是以与我们的ILSVRC评估程序类似的方式进行的(Sect.3.2)。
- Results marked with * were achieved using ConvNets pre-trained on the extended ILSVRC dataset (2000 classes).使用在扩展ILSVRC数据集(2000个类)上预训练的ConvNets获得标有*的结果。
- Our methods set the new state of the art across image representations, pretrained on the ILSVRC dataset, outperforming the previous best result of Chatfield et al. (2014) by more than 6%.我们的方法在图像表示上设置了新的技术状态,在ILSVRC数据集上进行了预训练,性能优于Chatfield等人(2014)之前的最佳结果有超过6%。
- It should be noted that the method of Wei et al. (2014), which achieves 1% better mAP on VOC-2012, is pre-trained on an extended 2000-class ILSVRC dataset, which includes additional 1000 categories, semantically close to those in VOC datasets.应该注意的是Wei等人(2014)的方法在VOC-2012上实现了1%的mAP改善,在扩展的2000类ILSVRC数据集上进行了预训练,该数据集包括另外1000个类别,在语义上接近于VOC数据集中的类别。
- Results marked with * were achieved using ConvNets pre-trained on the extended ILSVRC dataset (1512 classes).使用在扩展ILSVRC数据集(1512个类)上预训练的ConvNets获得标有*的结果。
- Presents the experiments carried out before the ILSVRC submission.介绍在ILSVRC提交之前进行的实验。
- v2 Adds post-submission ILSVRC experiments with training set augmentation using scale jittering, which improves the performance.V2增加了提交后的ILSVRC实验,使用比例抖动增强训练集,从而提高了性能。
|
3 | bounding (20) | [baundɪŋ] |
- It can be seen as a special case of object detection, where a single object bounding box should be predicted for each of the top-5 classes, irrespective of the actual number of objects of the class.它可以被看作是对象检测的一种特殊情况,其中应该为前5个类中的每一个预测单个对象边界框,而不考虑该类的实际对象数量。
- To perform object localisation, we use a very deep ConvNet, where the last fully connected layer predicts the bounding box location instead of the class scores.为了进行目标定位,我们使用非常深的ConvNet,其中最后一个完全连接的层预测边界框位置,而不是类别分数。
- A bounding box is represented by a 4-D vector storing its center coordinates, width, and height.边界框由存储其中心坐标、宽度和高度的4维矢量表示。
- There is a choice of whether the bounding box prediction is shared across all classes (single-class regression, SCR (Sermanet et al. , 2014)) or is class-specific (per-class regression, PCR).可以选择边界框预测是跨所有类别共享(单个类别回归,SCR(Sermanet et al.,2014))或是特定类别(逐个类别回归,PCR)。
- Apart from the last bounding box prediction layer, we use the ConvNet architecture D (Table 1), which contains 16 weight layers and was found to be the best-performing in the classification task (Sect. 4).除了最后一个边界框预测层,我们使用ConvNet体系结构D(表1),它包含16个权重层,并且被发现在分类任务中表现最好(Sect. 4)。
- The main difference is that we replace the logistic regression objective with a Euclidean loss, which penalises the deviation of the predicted bounding box parameters from the ground-truth.主要的区别是我们用欧几里得损失代替逻辑回归目标,这惩罚了预测的边界框参数与实际值的偏差。
- The first is used for comparing different network modifications on the validation set, and considers only the bounding box prediction for the ground truth class (to factor out the classification errors).第一个用于比较验证集上的不同网络修改,并且仅考虑地面真值类的边界框预测(以排除分类误差)。
- The bounding box is obtained by applying the network only to the central crop of the image.通过将网络仅应用于图像的中心裁剪来获得边界框。
- The difference is that instead of the class score map, the output of the last fully-connected layer is a set of bounding box predictions.不同之处在于,最后一个完全连接的层的输出是一组边界框预测,而不是类得分映射。
- When several localisation ConvNets are used, we first take the union of their sets of bounding box predictions, and then run the merging procedure on the union.当使用几个本地化ConvNet时,我们首先获取它们的边界框预测集的并集,然后对该并集运行合并过程。
- We did not use the multiple pooling offsets technique of Sermanet et al. (2014), which increases the spatial resolution of the bounding box predictions and can further improve the results.我们没有使用Sermanet等人(2014)的多池补偿技术,它提高了边界框预测的空间分辨率,并可以进一步改进结果。
- The localisation error is measured according to the ILSVRC criterion (Russakovsky et al. , 2014), i.e. the bounding box prediction is deemed correct if its intersection over union ratio with the ground-truth bounding box is above 0.5.根据ILSVRC标准测量定位误差(Russakovsky等人,2014),即如果边界框预测与实际边界框的相交超过并比大于0.5,则认为其是正确的。
- The localisation error is measured according to the ILSVRC criterion (Russakovsky et al. , 2014), i.e. the bounding box prediction is deemed correct if its intersection over union ratio with the ground-truth bounding box is above 0.5.根据ILSVRC标准测量定位误差(Russakovsky等人,2014),即如果边界框预测与实际边界框的相交超过并比大于0.5,则认为其是正确的。
- Table 8: Localisation error for different modifications with the simplified testing protocol: the bounding box is predicted from a single central image crop, and the ground-truth class is used.表8:使用简化测试协议的不同修改的定位误差:从单个中心图像裁剪预测边界框,并使用实际类别。
- Having determined the best localisation setting (PCR, fine-tuning of all layers), we now apply it in the fully-fledged scenario, where the top-5 class labels are predicted using our best-performing classification system (Sect. 4.5), and multiple densely-computed bounding box predictions are merged using the method of Sermanet et al. (2014).在确定了最佳定位设置(PCR,所有层的微调)之后,我们现在将其应用于完全成熟的场景中,其中使用我们性能最佳的分类系统预测的top-5个类别标签(Sect. 4.5),并且使用Sermanet等人(2014年)的方法合并多个密集计算的边界框预测。
- We also evaluated our best-performing image representation (the stacking of Net-D and Net-E features) on the PASCAL VOC-2012 action classification task (Everingham et al. , 2015), which consists in predicting an action class from a single image, given a bounding box of the person performing the action.我们还在Pascal VOC-2012动作分类任务(Everingham等人,2015)上评估了我们的最佳性能图像表示(Net-D和Net-E特征的叠加),该任务包括从单个图像预测动作类,给定执行者的边界框。
- We considered two training settings: (i) computing the ConvNet features on the whole image and ignoring the provided bounding box; (ii) computing the features on the whole image and on the provided bounding box, and stacking them to obtain the final representation.我们考虑了两个训练设置:(I)计算整个图像上的凸网特征并忽略提供的边界框;(Ii)计算整个图像和提供的边界框上的特征,并将它们堆叠以获得最终表示。
- We considered two training settings: (i) computing the ConvNet features on the whole image and ignoring the provided bounding box; (ii) computing the features on the whole image and on the provided bounding box, and stacking them to obtain the final representation.我们考虑了两个训练设置:(I)计算整个图像上的凸网特征并忽略提供的边界框;(Ii)计算整个图像和提供的边界框上的特征,并将它们堆叠以获得最终表示。
- Our representation achieves the state of art on the VOC action classification task even without using the provided bounding boxes, and the results are further improved when using both images and bounding boxes.即使不使用提供的边界框,我们的表示也取得了VOC动作分类任务的最新水平,并且当同时使用图像和边界框时,结果得到了进一步改善。
- Our representation achieves the state of art on the VOC action classification task even without using the provided bounding boxes, and the results are further improved when using both images and bounding boxes.即使不使用提供的边界框,我们的表示也取得了VOC动作分类任务的最新水平,并且当同时使用图像和边界框时,结果得到了进一步改善。
|
4 | Krizhevsky (17) | |
- With ConvNets becoming more of a commodity in the computer vision field, a number of attempts have been made to improve the original architecture of Krizhevsky et al. (2012) in a bid to achieve better accuracy.随着ConvNets在计算机视觉领域越来越商品化,为了达到更好的准确性,已经进行了许多尝试来改进Krizhevsky等人(2012)最初的架构。
- To measure the improvement brought by the increased ConvNet depth in a fair setting, all our ConvNet layer configurations are designed using the same principles, inspired by Ciresan et al. (2011); Krizhevsky et al. (2012).为了衡量ConvNet深度在公平环境中所带来的改进,我们所有的ConvNet层配置都使用相同的规则,灵感来自Ciresan等(2011);Krizhevsky等人(2012年)。
- All hidden layers are equipped with the rectification (ReLU (Krizhevsky et al. , 2012)) non-linearity.所有隐藏层都配备了修正(ReLU(Krizhevsky等,2012))非线性。
- We note that none of our networks (except for one) contain Local Response Normalisation (LRN) normalisation (Krizhevsky et al. , 2012): as will be shown in Sect. 4, such normalisation does not improve the performance on the ILSVRC dataset, but leads to increased memory consumption and computation time.我们注意到,我们的网络(除了一个)都不包含局部响应规范化(LRN)(Krizhevsky等,2012):将在第4节看到,这种规范化并不能提高在ILSVRC数据集上的性能,但增加了内存消耗和计算时间。
- Where applicable, the parameters for the LRN layer are those of (Krizhevsky et al. , 2012).在应用的地方,LRN层的参数是(Krizhevsky等,2012)的参数。
- Our ConvNet configurations are quite different from the ones used in the top-performing entries of the ILSVRC-2012 (Krizhevsky et al. , 2012) and ILSVRC-2013 competitions (Zeiler & Fergus, 2013; Sermanet et al. , 2014).我们的ConvNet配置与ILSVRC-2012(Krizhevsky等,2012)和ILSVRC-2013比赛(Zeiler&Fergus,2013;Sermanet等,2014)表现最佳的参赛提交中使用的ConvNet配置有很大不同。
- Rather than using relatively large receptive fields in the first conv. layers (e.g. 11 × 11 with stride 4 in (Krizhevsky et al. , 2012), or 7 × 7 with stride 2 in (Zeiler & Fergus, 2013; Sermanet et al. , 2014)), we use very small 3 × 3 receptive fields throughout the whole net, which are convolved with the input at every pixel (with stride 1).不是在第一卷积层中使用相对较大的感受野(例如,在(Krizhevsky等人,2012)中的11×11,步长为4,或在(Zeiler&Fergus,2013;Sermanet等,2014)中的7×7,步长为2),我们在整个网络使用非常小的3×3感受野,与输入的每个像素(步长为1)进行卷积。
- The ConvNet training procedure generally follows Krizhevsky et al. (2012) (except for sampling the input crops from multi-scale training images, as explained later).ConvNet训练过程通常遵循Krizhevsky等人(2012)(除了从多尺度训练图像中对输入裁剪图像进行采样外,如下文所述)。
- We conjecture that in spite of the larger number of parameters and the greater depth of our nets compared to (Krizhevsky et al. , 2012), the nets required less epochs to converge due to (a) implicit regularisation imposed by greater depth and smaller conv. filter sizes; (b) pre-initialisation of certain layers.我们推测,尽管与(Krizhevsky等,2012)相比我们的网络参数更多,网络的深度更大,但网络需要更小的epoch就可以收敛,这是由于(a)由更大的深度和更小的卷积滤波器尺寸引起的隐式正则化,(b)某些层的预初始化。
- To further augment the training set, the crops underwent random horizontal flipping and random RGB colour shift (Krizhevsky et al. , 2012).为了进一步增强训练集,裁剪图像经过了随机水平翻转和随机RGB颜色偏移(Krizhevsky等,2012)。
- In our experiments, we evaluated models trained at two fixed scales: S = 256 (which has been widely used in the prior art (Krizhevsky et al. , 2012; Zeiler & Fergus, 2013; Sermanet et al. , 2014)) and S = 384.在我们的实验中,我们评估了以两个固定尺度训练的模型:S = 256(已经在现有技术中广泛使用(Krizhevsky等人,2012;Zeiler&Fergus,2013;Sermanet等,2014))和S = 384。
- Since the fully-convolutional network is applied over the whole image, there is no need to sample multiple crops at test time (Krizhevsky et al. , 2012), which is less efficient as it requires network re-computation for each crop.由于全卷积网络被应用在整个图像上,所以不需要在测试时对采样多个裁剪图像(Krizhevsky等,2012),因为它需要网络重新计算每个裁剪图像,这样效率较低。
- While more sophisticated methods of speeding up ConvNet training have been recently proposed (Krizhevsky, 2014), which employ model and data parallelism for different layers of the net, we have found that our conceptually much simpler scheme already provides a speedup of 3.75 times on an off-the-shelf 4-GPU system, as compared to using a single GPU.最近提出了更加复杂的加速ConvNet训练的方法(Krizhevsky,2014),它们对网络的不同层之间采用模型和数据并行,我们发现我们概念上更简单的方案与使用单个GPU相比,在现有的4-GPU系统上已经提供了3.75倍的加速。
- This improves the performance due to complementarity of the models, and was used in the top ILSVRC submissions in 2012 (Krizhevsky et al. , 2012) and 2013 (Zeiler & Fergus, 2013; Sermanet et al. , 2014).由于模型的互补性,这提高了性能,并且在了2012年(Krizhevsky等,2012)和2013年(Zeiler&Fergus,2013;Sermanet等,2014)ILSVRC的顶级提交中使用。
- It was demonstrated that the representation depth is beneficial for the classification accuracy, and that state-of-the-art performance on the ImageNet challenge dataset can be achieved using a conventional ConvNet architecture (LeCun et al. , 1989; Krizhevsky et al. , 2012) with substantially increased depth.已经证明,表示深度有利于分类精度,并且深度大大增加的传统ConvNet架构(LeCun等,1989;Krizhevsky等,2012)可以实现ImageNet挑战数据集上的最佳性能。
- For instance, Girshick et al. (2014) achieve the state of the object detection results by replacing the ConvNet of Krizhevsky et al. (2012) with our 16-layer model.例如,Girshick等人(2014)通过使用我们的16层模型替换Krizhevsky等人(2012)的ConvNet来实现对象检测结果的状态。
- Similar gains over a more shallow architecture of Krizhevsky et al. (2012) have been observed in semantic segmentation (Long et al. , 2014), image caption generation (Kiros et al. , 2014; Karpathy & Fei-Fei, 2014), texture and material recognition (Cimpoi et al. , 2014; Bell et al. , 2014).Krizhevsky等人(2012)在更浅的架构,已经在语义分割(Long等人,2014)、图像字幕生成(Kiros等人,2014;Karpathy&Fei-Fei,2014)、纹理和材料识别(Cimpoi等人,2014;Bell等人,2014)中观察到能获得类似的收益。
|
5 | Sermanet (17) | |
- For instance, the best-performing submissions to the ILSVRC-2013 (Zeiler & Fergus, 2013; Sermanet et al. , 2014) utilised smaller receptive window size and smaller stride of the first convolutional layer.例如,ILSVRC-2013(Zeiler&Fergus,2013;Sermanet等,2014)表现最佳的提交使用了更小的感受窗口尺寸和更小的第一卷积层步长。
- Another line of improvements dealt with training and testing the networks densely over the whole image and over multiple scales (Sermanet et al. , 2014; Howard, 2014).另一条改进措施在整个图像和多个尺度上对网络进行密集地训练和测试(Sermanet等,2014;Howard,2014)。
- In spite of a large depth, the number of weights in our nets is not greater than the number of weights in a more shallow net with larger conv. layer widths and receptive fields (144M weights in (Sermanet et al. , 2014)).尽管深度很大,我们的网络中权重数量并不大于具有更大卷积层宽度和感受野的较浅网络中的权重数量(144M的权重在(Sermanet等人,2014)中)。
- Our ConvNet configurations are quite different from the ones used in the top-performing entries of the ILSVRC-2012 (Krizhevsky et al. , 2012) and ILSVRC-2013 competitions (Zeiler & Fergus, 2013; Sermanet et al. , 2014).我们的ConvNet配置与ILSVRC-2012(Krizhevsky等,2012)和ILSVRC-2013比赛(Zeiler&Fergus,2013;Sermanet等,2014)表现最佳的参赛提交中使用的ConvNet配置有很大不同。
- Rather than using relatively large receptive fields in the first conv. layers (e.g. 11 × 11 with stride 4 in (Krizhevsky et al. , 2012), or 7 × 7 with stride 2 in (Zeiler & Fergus, 2013; Sermanet et al. , 2014)), we use very small 3 × 3 receptive fields throughout the whole net, which are convolved with the input at every pixel (with stride 1).不是在第一卷积层中使用相对较大的感受野(例如,在(Krizhevsky等人,2012)中的11×11,步长为4,或在(Zeiler&Fergus,2013;Sermanet等,2014)中的7×7,步长为2),我们在整个网络使用非常小的3×3感受野,与输入的每个像素(步长为1)进行卷积。
- In our experiments, we evaluated models trained at two fixed scales: S = 256 (which has been widely used in the prior art (Krizhevsky et al. , 2012; Zeiler & Fergus, 2013; Sermanet et al. , 2014)) and S = 384.在我们的实验中,我们评估了以两个固定尺度训练的模型:S = 256(已经在现有技术中广泛使用(Krizhevsky等人,2012;Zeiler&Fergus,2013;Sermanet等,2014))和S = 384。
- Then, the network is applied densely over the rescaled test image in a way similar to (Sermanet et al. , 2014).然后,网络以类似于(Sermanet等人,2014)的方式密集地应用于归一化的测试图像上。
- This improves the performance due to complementarity of the models, and was used in the top ILSVRC submissions in 2012 (Krizhevsky et al. , 2012) and 2013 (Zeiler & Fergus, 2013; Sermanet et al. , 2014).由于模型的互补性,这提高了性能,并且在了2012年(Krizhevsky等,2012)和2013年(Zeiler&Fergus,2013;Sermanet等,2014)ILSVRC的顶级提交中使用。
- For this we adopt the approach of Sermanet et al. (2014), the winners of the ILSVRC-2013 localisation challenge, with a fewmodifications.为此,我们采用Sermanet等人(2014)的方法,仅作了几处修改。Sermanet等人是ILSVRC-2013定位挑战的获胜者。
- There is a choice of whether the bounding box prediction is shared across all classes (single-class regression, SCR (Sermanet et al. , 2014)) or is class-specific (per-class regression, PCR).可以选择边界框预测是跨所有类别共享(单个类别回归,SCR(Sermanet et al.,2014))或是特定类别(逐个类别回归,PCR)。
- We explored both fine-tuning all layers and fine-tuning only the first two fully-connected layers, as done in (Sermanet et al. , 2014).我们探索了微调所有层和仅微调前两个完全连接的层,如(Sermanet等人,2014年)。
- To come up with the final prediction, we utilise the greedy merging procedure of Sermanet et al. (2014), which first merges spatially close predictions (by averaging their coordinates), and then rates them based on the class scores, obtained from the classification ConvNet.为了得到最终的预测,我们利用Sermanet等人(2014)的贪婪合并过程,它首先合并空间上接近的预测(通过平均它们的坐标),然后基于从分类ConvNet获得的类别得分对它们进行评级。
- We did not use the multiple pooling offsets technique of Sermanet et al. (2014), which increases the spatial resolution of the bounding box predictions and can further improve the results.我们没有使用Sermanet等人(2014)的多池补偿技术,它提高了边界框预测的空间分辨率,并可以进一步改进结果。
- Settings comparison. As can be seen from Table 8, per-class regression (PCR) outperforms the class-agnostic single-class regression (SCR), which differs from the findings of Sermanet et al. (2014), where PCR was outperformed by SCR.设置比较。从表8可以看出,逐类回归(PCR)优于类不可知的单类回归(SCR),这与Sermanet等人(2014)的发现不同,后者的PCR表现优于SCR。
- We also note that fine-tuning all layers for the localisation task leads to noticeably better results than fine-tuning only the fully-connected layers (as done in (Sermanet et al. , 2014)).我们还注意到,为本地化任务微调所有层比仅微调完全连接的层(如(Sermanet et al.,2014)中所做的)会导致明显更好的结果。
- Having determined the best localisation setting (PCR, fine-tuning of all layers), we now apply it in the fully-fledged scenario, where the top-5 class labels are predicted using our best-performing classification system (Sect. 4.5), and multiple densely-computed bounding box predictions are merged using the method of Sermanet et al. (2014).在确定了最佳定位设置(PCR,所有层的微调)之后,我们现在将其应用于完全成熟的场景中,其中使用我们性能最佳的分类系统预测的top-5个类别标签(Sect. 4.5),并且使用Sermanet等人(2014年)的方法合并多个密集计算的边界框预测。
- Notably, our results are considerably better than those of the ILSVRC-2013 winner Overfeat (Sermanet et al. , 2014), even though we used less scales and did not employ their resolution enhancement technique.值得注意的是,我们的结果比ILSVRC-2013获奖者Overfeat(Sermanet等人,2014)的结果要好得多,尽管我们使用了更少的比例并且没有使用他们的分辨率增强技术。
|
6 | Caltech (15) | ['kæltek] |
- Table 11: Comparison with the state of the art in image classification on VOC-2007, VOC-2012, Caltech-101, and Caltech-256.表11:与VOC-2007、VOC-2012、CALTECH-101和CALTECH-256上图像分类的最新水平进行比较。
- Table 11: Comparison with the state of the art in image classification on VOC-2007, VOC-2012, Caltech-101, and Caltech-256.表11:与VOC-2007、VOC-2012、CALTECH-101和CALTECH-256上图像分类的最新水平进行比较。
- Image Classification on Caltech-101 and Caltech-256.基于CALTECH-101和CALTECH-256的图像分类。
- Image Classification on Caltech-101 and Caltech-256.基于CALTECH-101和CALTECH-256的图像分类。
- In this section we evaluate very deep features on Caltech-101 (Fei-Fei et al. , 2004) and Caltech-256 (Griffin et al. , 2007) image classification benchmarks.在本节中,我们评估了Caltech-101(Fei-Fei等人,2004)和Caltech-256(Griffin等人,2007)图像分类基准的非常深入的特征。
- In this section we evaluate very deep features on Caltech-101 (Fei-Fei et al. , 2004) and Caltech-256 (Griffin et al. , 2007) image classification benchmarks.在本节中,我们评估了Caltech-101(Fei-Fei等人,2004)和Caltech-256(Griffin等人,2007)图像分类基准的非常深入的特征。
- Caltech-101 contains 9K images labelled into 102 classes (101 object categories and a background class), while Caltech-256 is larger with 31K images and 257 classes.Caltech-101包含标记为102个类别(101个对象类别和一个背景类别)的9K图像,而Caltech-256更大,具有31K图像和257个类别。
- Caltech-101 contains 9K images labelled into 102 classes (101 object categories and a background class), while Caltech-256 is larger with 31K images and 257 classes.Caltech-101包含标记为102个类别(101个对象类别和一个背景类别)的9K图像,而Caltech-256更大,具有31K图像和257个类别。
- Following Chatfield et al. (2014); Zeiler & Fergus (2013); He et al. (2014), on Caltech-101 we generated 3 random splits into training and test data, so that each split contains 30 training images per class, and up to 50 test images per class.遵循Chatfield等人(2014);Zeiler&Fergus(2013);He等人(2014),在Caltech-101上,我们将3个随机分割生成到训练和测试数据中,因此每个分割包含每个类30个训练图像,每个类最多包含50个测试图像。
- On Caltech-256 we also generated 3 splits, each of which contains 60 training images per class (and the rest is used for testing).在Caltech-256上,我们还生成了3个拆分,每个拆分包含每个类的60个训练图像(其余用于测试)。
- We found that unlike VOC, on Caltech datasets the stacking of descriptors, computed over multiple scales, performs better than averaging or max-pooling.我们发现,与VOC不同的是,在Caltech数据集上,通过多个比例计算的描述符堆叠比平均或最大池化(max-pooling)性能更好。
- This can be explained by the fact that in Caltech images objects typically occupy the whole image, so multi-scale image features are semantically different (capturing the whole object vs. object parts), and stacking allows a classifier to exploit such scale-specific representations.这可以通过以下事实来解释:在Caltech图像中,对象通常占据整个图像,因此多尺度图像特征在语义上是不同的(捕获整个对象与对象部分),并且堆叠允许分类器利用这种比例特定的表示。
- On Caltech-101, our representations are competitive with the approach of He et al. (2014), which, however, performs significantly worse than our nets on VOC-2007.在CALTECH-101上,我们的陈述与He等人(2014)的方法具有竞争力,然而,与我们在VOC-2007上的网络相比,后者的表现要差得多。
- On Caltech-256, our features outperform the state of the art (Chatfield et al. , 2014) by a large margin (8.6%).在Caltech-256上,我们的功能远远超过最先进的技术(Chatfield等人,2014)(8.6%)。
- v3 Adds generalisation experiments (Appendix B) on PASCAL VOC and Caltech image classification datasets.V3增加了对Pascal VOC和Caltech图像分类数据集的泛化实验(附录B)。
|
7 | submission (12) | [səbˈmɪʃn] |
- These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively.这些发现是我们的ImageNet Challenge 2014提交的基础,我们的团队在定位和分类过程中分别获得了第一名和第二名。
- For instance, the best-performing submissions to the ILSVRC-2013 (Zeiler & Fergus, 2013; Sermanet et al. , 2014) utilised smaller receptive window size and smaller stride of the first convolutional layer.例如,ILSVRC-2013(Zeiler&Fergus,2013;Sermanet等,2014)表现最佳的提交使用了更小的感受窗口尺寸和更小的第一卷积层步长。
- It is worth noting that after the paper submission we found that it is possible to initialise the weights without pre-training by using the random initialisation procedure of Glorot & Bengio (2010).值得注意的是,在提交论文之后,我们发现可以通过使用Glorot & Bengio(2010)的随机初始化程序来初始化权重而不进行预训练。
- This improves the performance due to complementarity of the models, and was used in the top ILSVRC submissions in 2012 (Krizhevsky et al. , 2012) and 2013 (Zeiler & Fergus, 2013; Sermanet et al. , 2014).由于模型的互补性,这提高了性能,并且在了2012年(Krizhevsky等,2012)和2013年(Zeiler&Fergus,2013;Sermanet等,2014)ILSVRC的顶级提交中使用。
- By the time of ILSVRC submission we had only trained the single-scale networks, as well as a multi-scale model D (by fine-tuning only the fully-connected layers rather than all layers).在ILSVRC提交的时候,我们只训练了单规模网络,以及一个多尺度模型D(仅在全连接层进行微调而不是所有层)。
- After the submission, we considered an ensemble of only two best-performing multi-scale models (configurations D and E), which reduced the test error to 7.0% using dense evaluation and 6.8% using combined dense and multi-crop evaluation.在提交之后,我们考虑了只有两个表现最好的多尺度模型(配置D和E)的组合,它使用密集评估将测试误差降低到7.0%,使用密集评估和多裁剪图像评估将测试误差降低到6.8%。
- After the submission, we decreased the error rate to 6.8% using an ensemble of 2 models.提交后,我们使用2个模型的组合将错误率降低到6.8%。
- Our result is also competitive with respect to the classification task winner (GoogLeNet with 6.7% error) and substantially outperforms the ILSVRC-2013 winning submission Clarifai, which achieved 11.2% with outside training data and 11.7% without it.我们的结果对于分类任务获胜者(GoogLeNet具有6.7%的错误率)也具有竞争力,并且大大优于ILSVRC-2013获胜者Clarifai的提交,其使用外部训练数据取得了11.2%的错误率,没有外部数据则为11.7%。
- This is remarkable, considering that our best result is achieved by combining just two models —— significantly less than used in most ILSVRC submissions.这是非常显著的,考虑到我们最好的结果是仅通过组合两个模型实现的——明显少于大多数ILSVRC提交。
- We trained two localisation models, each on a single scale: S = 256 and S = 384 (due to the time constraints, we did not use training scale jittering for our ILSVRC-2014 submission).我们训练了两个定位模型,每个模型都在单个规模上:S=256和S=384(由于时间限制,我们没有在ILSVRC-2014提交中使用训练规模抖动)。
- Presents the experiments carried out before the ILSVRC submission.介绍在ILSVRC提交之前进行的实验。
- v4 The paper is converted to ICLR-2015 submission format.V4将论文转换为ICLR-2015提交格式。
|
8 | receptive (11) | [rɪˈseptɪv] |
- For instance, the best-performing submissions to the ILSVRC-2013 (Zeiler & Fergus, 2013; Sermanet et al. , 2014) utilised smaller receptive window size and smaller stride of the first convolutional layer.例如,ILSVRC-2013(Zeiler&Fergus,2013;Sermanet等,2014)表现最佳的提交使用了更小的感受窗口尺寸和更小的第一卷积层步长。
- The image is passed through a stack of convolutional (conv. ) layers, where we use filters with a very small receptive field: 3 × 3 (which is the smallest size to capture the notion of left/right, up/down, center).图像通过一堆卷积(conv.)层,我们使用感受野很小的滤波器:3×3(这是捕获左/右,上/下,中心概念的最小尺寸)。
- In spite of a large depth, the number of weights in our nets is not greater than the number of weights in a more shallow net with larger conv. layer widths and receptive fields (144M weights in (Sermanet et al. , 2014)).尽管深度很大,我们的网络中权重数量并不大于具有更大卷积层宽度和感受野的较浅网络中的权重数量(144M的权重在(Sermanet等人,2014)中)。
- Rather than using relatively large receptive fields in the first conv. layers (e.g. 11 × 11 with stride 4 in (Krizhevsky et al. , 2012), or 7 × 7 with stride 2 in (Zeiler & Fergus, 2013; Sermanet et al. , 2014)), we use very small 3 × 3 receptive fields throughout the whole net, which are convolved with the input at every pixel (with stride 1).不是在第一卷积层中使用相对较大的感受野(例如,在(Krizhevsky等人,2012)中的11×11,步长为4,或在(Zeiler&Fergus,2013;Sermanet等,2014)中的7×7,步长为2),我们在整个网络使用非常小的3×3感受野,与输入的每个像素(步长为1)进行卷积。
- Rather than using relatively large receptive fields in the first conv. layers (e.g. 11 × 11 with stride 4 in (Krizhevsky et al. , 2012), or 7 × 7 with stride 2 in (Zeiler & Fergus, 2013; Sermanet et al. , 2014)), we use very small 3 × 3 receptive fields throughout the whole net, which are convolved with the input at every pixel (with stride 1).不是在第一卷积层中使用相对较大的感受野(例如,在(Krizhevsky等人,2012)中的11×11,步长为4,或在(Zeiler&Fergus,2013;Sermanet等,2014)中的7×7,步长为2),我们在整个网络使用非常小的3×3感受野,与输入的每个像素(步长为1)进行卷积。
- It is easy to see that a stack of two 3 × 3 conv. layers (without spatial pooling in between) has an effective receptive field of 5 × 5; three such layers have a 7 × 7 effective receptive field.很容易看到两个3×3卷积层堆叠(没有空间池化)有5×5的有效感受野;三个这样的层具有7×7的有效感受野。
- It is easy to see that a stack of two 3 × 3 conv. layers (without spatial pooling in between) has an effective receptive field of 5 × 5; three such layers have a 7 × 7 effective receptive field.很容易看到两个3×3卷积层堆叠(没有空间池化)有5×5的有效感受野;三个这样的层具有7×7的有效感受野。
- The incorporation of 1 × 1 conv. layers (configuration C, Table 1) is a way to increase the non-linearity of the decision function without affecting the receptive fields of the conv.结合1×1卷积层(配置C,表1)是增加决策函数非线性而不影响卷积层感受野的一种方式。
- Also, multi-crop evaluation is complementary to dense evaluation due to different convolution boundary conditions: when applying a ConvNet to a crop, the convolved feature maps are padded with zeros, while in the case of dense evaluation the padding for the same crop naturally comes from the neighbouring parts of an image (due to both the convolutions and spatial pooling), which substantially increases the overall network receptive field, so more context is captured.此外,由于不同的卷积边界条件,多裁剪图像评估是密集评估的补充:当将ConvNet应用于裁剪图像时,卷积特征图用零填充,而在密集评估的情况下,相同裁剪图像的填充自然会来自于图像的相邻部分(由于卷积和空间池化),这大大增加了整个网络的感受野,因此捕获了更多的上下文。
- layers throughout the network. This indicates that while the additional non-linearity does help (C is better than B), it is also important to capture spatial context by using conv. filters with non-trivial receptive fields (D is better than C).这表明,虽然额外的非线性确实有帮助(C优于B),但也可以通过使用具有非平凡感受野(D比C好)的卷积滤波器来捕获空间上下文。
- layer (which has the same receptive field as explained in Sect. 2.3). The top-1 error of the shallow net was measured to be 7% higher than that of B (on a center crop), which confirms that a deep net with small filters outperforms a shallow net with larger filters.测量的浅层网络top-1错误率比网络B的top-1错误率(在中心裁剪图像上)高7%,这证实了具有小滤波器的深层网络优于具有较大滤波器的浅层网络。
|
9 | descriptor (10) | [dɪˈskrɪptə(r)] |
- The resulting image descriptor is L2-normalised and combined with a linear SVM classifier, trained on the target dataset.得到的图像描述符是L2归一化的,并与线性SVM分类器结合,在目标数据集上训练。
- We then perform global average pooling on the resulting feature map, which produces a 4096-D image descriptor.然后,我们对生成的特征映射执行全局平均池,这将生成4096-D图像描述符。
- The descriptor is then averaged with the descriptor of a horizontally flipped image.然后将描述符与水平翻转图像的描述符进行平均。
- The descriptor is then averaged with the descriptor of a horizontally flipped image.然后将描述符与水平翻转图像的描述符进行平均。
- Stacking allows a subsequent classifier to learn how to optimally combine image statistics over a range of scales; this, however, comes at the cost of the increased descriptor dimensionality.堆叠允许随后的分类器学习如何在一定范围内最佳地组合图像统计数据;然而,这是以增加的描述符维数为代价的。
- We also assess late fusion of features, computed using two networks, which is performed by stacking their respective image descriptors.我们还评估了使用两个网络计算的特征的后期融合,这是通过堆叠它们各自的图像描述符来执行的。
- Notably, by examining the performance on the validation sets of VOC-2007 and VOC-2012, we found that aggregating image descriptors, computed at multiple scales, by averaging performs similarly to the aggregation by stacking.值得注意的是,通过检查VOC-2007和VOC-2012验证集的性能,我们发现通过平均来聚合在多个比例下计算的图像描述符的性能类似于通过堆叠进行聚合。
- Since averaging has a benefit of not inflating the descriptor dimensionality, we were able to aggregated image descriptors over a wide range of scales: $Q \in \{256, 384, 512, 640, 768\}$.由于平均具有不膨胀描述符维度的优点,我们能够在广泛的范围内聚合图像描述符:$Q \in \{256,384,512,640,768\}$中。
- Since averaging has a benefit of not inflating the descriptor dimensionality, we were able to aggregated image descriptors over a wide range of scales: $Q \in \{256, 384, 512, 640, 768\}$.由于平均具有不膨胀描述符维度的优点,我们能够在广泛的范围内聚合图像描述符:$Q \in \{256,384,512,640,768\}$中。
- We found that unlike VOC, on Caltech datasets the stacking of descriptors, computed over multiple scales, performs better than averaging or max-pooling.我们发现,与VOC不同的是,在Caltech数据集上,通过多个比例计算的描述符堆叠比平均或最大池化(max-pooling)性能更好。
|
10 | jittering (9) | [ˈdʒitərɪŋ] |
- This can also be seen as training set augmentation by scale jittering, where a single model is trained to recognise objects over a wide range of scales.这也可以看作是通过尺度抖动进行训练集增强,其中单个模型被训练在一定尺度范围内识别对象。
- Finally, scale jittering at training time ($S \in [256; 512]$) leads to significantly better results than training on images with fixed smallest side (S = 256 or S = 384), even though a single scale is used at test time.最后,训练时的尺度抖动($S \in [256; 512]$)得到了与固定最小边(S = 256或S = 384)的图像训练相比更好的结果,即使在测试时使用单尺度。
- This confirms that training set augmentation by scale jittering is indeed helpful for capturing multi-scale image statistics.这证实了通过尺度抖动进行的训练集增强确实有助于捕获多尺度图像统计。
- Having evaluated the ConvNet models at a single scale, we now assess the effect of scale jittering at test time.在单尺度上评估ConvNet模型后,我们现在评估测试时尺度抖动的影响。
- At the same time, scale jittering at training time allows the network to be applied to a wider range of scales at test time, so the model trained with variable $ S \in [S_{min}; S_{max}] $ was evaluated over a larger range of sizes $ Q = \{S_{min}, 0.5(S_{min} + S_{max}), S_{max} \} $.同时,训练时的尺度抖动允许网络在测试时应用于更广的尺度范围,所以用变量$S \in [S_{min}; S_{max}]$训练的模型在更大的尺寸范围$ Q = \{S_{min}, 0.5(S_{min} + S_{max}), S_{max} \} $上进行评估。
- The results, presented in Table 4, indicate that scale jittering at test time leads to better performance (as compared to evaluating the same model at a single scale, shown in Table 3).表4中给出的结果表明,测试时的尺度抖动导致了更好的性能(与在单一尺度上相同模型的评估相比,如表3所示)。
- As before, the deepest configurations (D and E) perform the best, and scale jittering is better than training with a fixed smallest side S.如前所述,最深的配置(D和E)执行最佳,并且尺度抖动优于使用固定最小边S的训练。
- We trained two localisation models, each on a single scale: S = 256 and S = 384 (due to the time constraints, we did not use training scale jittering for our ILSVRC-2014 submission).我们训练了两个定位模型,每个模型都在单个规模上:S=256和S=384(由于时间限制,我们没有在ILSVRC-2014提交中使用训练规模抖动)。
- v2 Adds post-submission ILSVRC experiments with training set augmentation using scale jittering, which improves the performance.V2增加了提交后的ILSVRC实验,使用比例抖动增强训练集,从而提高了性能。
|
11 | Zeiler (7) | |
- For instance, the best-performing submissions to the ILSVRC-2013 (Zeiler & Fergus, 2013; Sermanet et al. , 2014) utilised smaller receptive window size and smaller stride of the first convolutional layer.例如,ILSVRC-2013(Zeiler&Fergus,2013;Sermanet等,2014)表现最佳的提交使用了更小的感受窗口尺寸和更小的第一卷积层步长。
- Our ConvNet configurations are quite different from the ones used in the top-performing entries of the ILSVRC-2012 (Krizhevsky et al. , 2012) and ILSVRC-2013 competitions (Zeiler & Fergus, 2013; Sermanet et al. , 2014).我们的ConvNet配置与ILSVRC-2012(Krizhevsky等,2012)和ILSVRC-2013比赛(Zeiler&Fergus,2013;Sermanet等,2014)表现最佳的参赛提交中使用的ConvNet配置有很大不同。
- Rather than using relatively large receptive fields in the first conv. layers (e.g. 11 × 11 with stride 4 in (Krizhevsky et al. , 2012), or 7 × 7 with stride 2 in (Zeiler & Fergus, 2013; Sermanet et al. , 2014)), we use very small 3 × 3 receptive fields throughout the whole net, which are convolved with the input at every pixel (with stride 1).不是在第一卷积层中使用相对较大的感受野(例如,在(Krizhevsky等人,2012)中的11×11,步长为4,或在(Zeiler&Fergus,2013;Sermanet等,2014)中的7×7,步长为2),我们在整个网络使用非常小的3×3感受野,与输入的每个像素(步长为1)进行卷积。
- In our experiments, we evaluated models trained at two fixed scales: S = 256 (which has been widely used in the prior art (Krizhevsky et al. , 2012; Zeiler & Fergus, 2013; Sermanet et al. , 2014)) and S = 384.在我们的实验中,我们评估了以两个固定尺度训练的模型:S = 256(已经在现有技术中广泛使用(Krizhevsky等人,2012;Zeiler&Fergus,2013;Sermanet等,2014))和S = 384。
- This improves the performance due to complementarity of the models, and was used in the top ILSVRC submissions in 2012 (Krizhevsky et al. , 2012) and 2013 (Zeiler & Fergus, 2013; Sermanet et al. , 2014).由于模型的互补性,这提高了性能,并且在了2012年(Krizhevsky等,2012)和2013年(Zeiler&Fergus,2013;Sermanet等,2014)ILSVRC的顶级提交中使用。
- Recently, there has been a lot of interest in such a use case (Zeiler & Fergus, 2013; Donahue et al. , 2013; Razavian et al. , 2014; Chatfield et al. , 2014), as it turns out that deep image representations, learnt on ILSVRC, generalise well to other datasets, where they have outperformed hand-crafted representations by a large margin.最近,人们对这样一个用例非常感兴趣(Zeiler&Fergus,2013;Donahue等人,2013;Razvian等人,2014;Chatfield等人,2014),因为事实证明,在ILSVRC上学习的深层图像表示很好地推广到其他数据集,在这些数据集中,它们的表现远远超过手工表示。
- Following Chatfield et al. (2014); Zeiler & Fergus (2013); He et al. (2014), on Caltech-101 we generated 3 random splits into training and test data, so that each split contains 30 training images per class, and up to 50 test images per class.遵循Chatfield等人(2014);Zeiler&Fergus(2013);He等人(2014),在Caltech-101上,我们将3个随机分割生成到训练和测试数据中,因此每个分割包含每个类30个训练图像,每个类最多包含50个测试图像。
|
12 | Fergus (7) | ['fә:gәs] |
- For instance, the best-performing submissions to the ILSVRC-2013 (Zeiler & Fergus, 2013; Sermanet et al. , 2014) utilised smaller receptive window size and smaller stride of the first convolutional layer.例如,ILSVRC-2013(Zeiler&Fergus,2013;Sermanet等,2014)表现最佳的提交使用了更小的感受窗口尺寸和更小的第一卷积层步长。
- Our ConvNet configurations are quite different from the ones used in the top-performing entries of the ILSVRC-2012 (Krizhevsky et al. , 2012) and ILSVRC-2013 competitions (Zeiler & Fergus, 2013; Sermanet et al. , 2014).我们的ConvNet配置与ILSVRC-2012(Krizhevsky等,2012)和ILSVRC-2013比赛(Zeiler&Fergus,2013;Sermanet等,2014)表现最佳的参赛提交中使用的ConvNet配置有很大不同。
- Rather than using relatively large receptive fields in the first conv. layers (e.g. 11 × 11 with stride 4 in (Krizhevsky et al. , 2012), or 7 × 7 with stride 2 in (Zeiler & Fergus, 2013; Sermanet et al. , 2014)), we use very small 3 × 3 receptive fields throughout the whole net, which are convolved with the input at every pixel (with stride 1).不是在第一卷积层中使用相对较大的感受野(例如,在(Krizhevsky等人,2012)中的11×11,步长为4,或在(Zeiler&Fergus,2013;Sermanet等,2014)中的7×7,步长为2),我们在整个网络使用非常小的3×3感受野,与输入的每个像素(步长为1)进行卷积。
- In our experiments, we evaluated models trained at two fixed scales: S = 256 (which has been widely used in the prior art (Krizhevsky et al. , 2012; Zeiler & Fergus, 2013; Sermanet et al. , 2014)) and S = 384.在我们的实验中,我们评估了以两个固定尺度训练的模型:S = 256(已经在现有技术中广泛使用(Krizhevsky等人,2012;Zeiler&Fergus,2013;Sermanet等,2014))和S = 384。
- This improves the performance due to complementarity of the models, and was used in the top ILSVRC submissions in 2012 (Krizhevsky et al. , 2012) and 2013 (Zeiler & Fergus, 2013; Sermanet et al. , 2014).由于模型的互补性,这提高了性能,并且在了2012年(Krizhevsky等,2012)和2013年(Zeiler&Fergus,2013;Sermanet等,2014)ILSVRC的顶级提交中使用。
- Recently, there has been a lot of interest in such a use case (Zeiler & Fergus, 2013; Donahue et al. , 2013; Razavian et al. , 2014; Chatfield et al. , 2014), as it turns out that deep image representations, learnt on ILSVRC, generalise well to other datasets, where they have outperformed hand-crafted representations by a large margin.最近,人们对这样一个用例非常感兴趣(Zeiler&Fergus,2013;Donahue等人,2013;Razvian等人,2014;Chatfield等人,2014),因为事实证明,在ILSVRC上学习的深层图像表示很好地推广到其他数据集,在这些数据集中,它们的表现远远超过手工表示。
- Following Chatfield et al. (2014); Zeiler & Fergus (2013); He et al. (2014), on Caltech-101 we generated 3 random splits into training and test data, so that each split contains 30 training images per class, and up to 50 test images per class.遵循Chatfield等人(2014);Zeiler&Fergus(2013);He等人(2014),在Caltech-101上,我们将3个随机分割生成到训练和测试数据中,因此每个分割包含每个类30个训练图像,每个类最多包含50个测试图像。
|
13 | rescale (7) | [ri:'skeɪl] |
- To obtain the fixed-size 224×224 ConvNet input images, they were randomly cropped from rescaled training images (one crop per image per SGD iteration).为了获得固定大小的224×224 ConvNet输入图像,它们从归一化的训练图像中被随机裁剪(每个图像每次SGD迭代进行一次裁剪)。
- Training image rescaling is explained below.下面解释训练图像归一化。
- The second approach to setting S is multi-scale training, where each training image is individually rescaled by randomly sampling S from a certain range $[S_{min},S_{max}]$ (we used $S_{min} = 256$ and $S_{max} = 512$).设置S的第二种方法是多尺度训练,其中每个训练图像通过从一定范围$[S_{min},S_{max}]$(我们使用$S_{min} = 256$ 和 $S_{max} = 512$)随机采样S来单独进行归一化。
- First, it is isotropically rescaled to a pre-defined smallest image side, denoted as Q (we also refer to it as the test scale).首先,将其等轴地归一化到预定义的最小图像边,表示为Q(我们也将其称为测试尺度)。
- Then, the network is applied densely over the rescaled test image in a way similar to (Sermanet et al. , 2014).然后,网络以类似于(Sermanet等人,2014)的方式密集地应用于归一化的测试图像上。
- It consists of running a model over several rescaled versions of a test image (corresponding to different values of Q), followed by averaging the resulting class posteriors.它包括在一张测试图像的几个归一化版本上运行模型(对应于不同的Q值),然后对所得到的类别后验进行平均。
- Namely, an image is first rescaled so that its smallest side equals Q, and then the network is densely applied over the image plane (which is possible when all weight layers are treated as convolutional).即,首先重新缩放图像,使得其最小侧等于Q,然后在图像平面上密集地应用网络(当所有权重层都被视为卷积时,这是可能的)。
|
14 | utilise (6) | ['ju:tɪlaɪz] |
- For instance, the best-performing submissions to the ILSVRC-2013 (Zeiler & Fergus, 2013; Sermanet et al. , 2014) utilised smaller receptive window size and smaller stride of the first convolutional layer.例如,ILSVRC-2013(Zeiler&Fergus,2013;Sermanet等,2014)表现最佳的提交使用了更小的感受窗口尺寸和更小的第一卷积层步长。
- In one of the configurations we also utilise 1 × 1 convolution filters, which can be seen as a linear transformation of the input channels (followed by non-linearity).在其中一种配置中,我们还使用了1×1卷积滤波器,可以看作输入通道的线性变换(后面是非线性)。
- It should be noted that 1 × 1 conv. layers have recently been utilised in the “Network in Network” architecture of Lin et al. (2014).应该注意的是1×1卷积层最近在Lin等人(2014)的“Network in Network”架构中已经得到了使用。
- To come up with the final prediction, we utilise the greedy merging procedure of Sermanet et al. (2014), which first merges spatially close predictions (by averaging their coordinates), and then rates them based on the class scores, obtained from the classification ConvNet.为了得到最终的预测,我们利用Sermanet等人(2014)的贪婪合并过程,它首先合并空间上接近的预测(通过平均它们的坐标),然后基于从分类ConvNet获得的类别得分对它们进行评级。
- Following that line of work, we investigate if our models lead to better performance than more shallow models utilised in the state-of-the-artmethods.遵循这一工作路线,我们研究我们的模型是否比现有技术中使用的更浅的模型具有更好的性能。
- To utilise the ConvNets, pre-trained on ILSVRC, for image classification on other datasets, we remove the last fully-connected layer (which performs 1000-way ILSVRC classification), and use 4096-D activations of the penultimate layer as image features, which are aggregated across multiple locations and scales.为了利用在ILSVRC上预先训练的ConvNets对其他数据集进行图像分类,我们删除了最后一个完全连接的层(它执行1000种ILSVRC分类),并使用倒数第二层的4096-D激活作为图像特征,这些图像特征在多个位置和规模上聚合。
|
15 | non-linearity (6) | ['nɒnlaɪn'ərɪtɪ] |
- In one of the configurations we also utilise 1 × 1 convolution filters, which can be seen as a linear transformation of the input channels (followed by non-linearity).在其中一种配置中,我们还使用了1×1卷积滤波器,可以看作输入通道的线性变换(后面是非线性)。
- All hidden layers are equipped with the rectification (ReLU (Krizhevsky et al. , 2012)) non-linearity.所有隐藏层都配备了修正(ReLU(Krizhevsky等,2012))非线性。
- This can be seen as imposing a regularisation on the 7 × 7 conv. filters, forcing them to have a decomposition through the 3 × 3 filters (with non-linearity injected in between).这可以看作是对7×7卷积滤波器进行正则化,迫使它们通过3×3滤波器(在它们之间注入非线性)进行分解。
- The incorporation of 1 × 1 conv. layers (configuration C, Table 1) is a way to increase the non-linearity of the decision function without affecting the receptive fields of the conv.结合1×1卷积层(配置C,表1)是增加决策函数非线性而不影响卷积层感受野的一种方式。
- layers. Even though in our case the 1 × 1 convolution is essentially a linear projection onto the space of the same dimensionality (the number of input and output channels is the same), an additional non-linearity is introduced by the rectification function.即使在我们的案例下,1×1卷积基本上是在相同维度空间上的线性投影(输入和输出通道的数量相同),由修正函数引入附加的非线性。
- layers throughout the network. This indicates that while the additional non-linearity does help (C is better than B), it is also important to capture spatial context by using conv. filters with non-trivial receptive fields (D is better than C).这表明,虽然额外的非线性确实有帮助(C优于B),但也可以通过使用具有非平凡感受野(D比C好)的卷积滤波器来捕获空间上下文。
|
16 | normalisation (6) | [,nɔ:məlai'zeiʃən] |
- We note that none of our networks (except for one) contain Local Response Normalisation (LRN) normalisation (Krizhevsky et al. , 2012): as will be shown in Sect. 4, such normalisation does not improve the performance on the ILSVRC dataset, but leads to increased memory consumption and computation time.我们注意到,我们的网络(除了一个)都不包含局部响应规范化(LRN)(Krizhevsky等,2012):将在第4节看到,这种规范化并不能提高在ILSVRC数据集上的性能,但增加了内存消耗和计算时间。
- We note that none of our networks (except for one) contain Local Response Normalisation (LRN) normalisation (Krizhevsky et al. , 2012): as will be shown in Sect. 4, such normalisation does not improve the performance on the ILSVRC dataset, but leads to increased memory consumption and computation time.我们注意到,我们的网络(除了一个)都不包含局部响应规范化(LRN)(Krizhevsky等,2012):将在第4节看到,这种规范化并不能提高在ILSVRC数据集上的性能,但增加了内存消耗和计算时间。
- We note that none of our networks (except for one) contain Local Response Normalisation (LRN) normalisation (Krizhevsky et al. , 2012): as will be shown in Sect. 4, such normalisation does not improve the performance on the ILSVRC dataset, but leads to increased memory consumption and computation time.我们注意到,我们的网络(除了一个)都不包含局部响应规范化(LRN)(Krizhevsky等,2012):将在第4节看到,这种规范化并不能提高在ILSVRC数据集上的性能,但增加了内存消耗和计算时间。
- First, we note that using local response normalisation (A-LRN network) does not improve on the model A without any normalisation layers.首先,我们注意到,使用局部响应归一化(A-LRN网络)在没有任何归一化层的情况下,对模型A没有改善。
- First, we note that using local response normalisation (A-LRN network) does not improve on the model A without any normalisation layers.首先,我们注意到,使用局部响应归一化(A-LRN网络)在没有任何归一化层的情况下,对模型A没有改善。
- We thus do not employ normalisation in the deeper architectures (B–E).因此,我们在较深的架构(B-E)中不采用归一化。
|
17 | flip (5) | [flɪp] |
- To further augment the training set, the crops underwent random horizontal flipping and random RGB colour shift (Krizhevsky et al. , 2012).为了进一步增强训练集,裁剪图像经过了随机水平翻转和随机RGB颜色偏移(Krizhevsky等,2012)。
- We also augment the test set by horizontal flipping of the images; the soft-max class posteriors of the original and flipped images are averaged to obtain the final scores for the image.我们还通过水平翻转图像来增强测试集;将原始图像和翻转图像的soft-max类后验进行平均,以获得图像的最终分数。
- We also augment the test set by horizontal flipping of the images; the soft-max class posteriors of the original and flipped images are averaged to obtain the final scores for the image.我们还通过水平翻转图像来增强测试集;将原始图像和翻转图像的soft-max类后验进行平均,以获得图像的最终分数。
- While we believe that in practice the increased computation time of multiple crops does not justify the potential gains in accuracy, for reference we also evaluate our networks using 50 crops per scale (5 × 5 regular grid with 2 flips), for a total of 150 crops over 3 scales, which is comparable to 144 crops over 4 scales used by Szegedy et al. (2014).虽然我们认为在实践中,多裁剪图像的计算时间增加并不足以证明准确性的潜在收益,但作为参考,我们还在每个尺度使用50个裁剪图像(5×5规则网格,2次翻转)评估了我们的网络,在3个尺度上总共150个裁剪图像,与Szegedy等人(2014)在4个尺度上使用的144个裁剪图像。
- The descriptor is then averaged with the descriptor of a horizontally flipped image.然后将描述符与水平翻转图像的描述符进行平均。
|
18 | substantially (5) | [səbˈstænʃəli] |
- Also, multi-crop evaluation is complementary to dense evaluation due to different convolution boundary conditions: when applying a ConvNet to a crop, the convolved feature maps are padded with zeros, while in the case of dense evaluation the padding for the same crop naturally comes from the neighbouring parts of an image (due to both the convolutions and spatial pooling), which substantially increases the overall network receptive field, so more context is captured.此外,由于不同的卷积边界条件,多裁剪图像评估是密集评估的补充:当将ConvNet应用于裁剪图像时,卷积特征图用零填充,而在密集评估的情况下,相同裁剪图像的填充自然会来自于图像的相邻部分(由于卷积和空间池化),这大大增加了整个网络的感受野,因此捕获了更多的上下文。
- Our result is also competitive with respect to the classification task winner (GoogLeNet with 6.7% error) and substantially outperforms the ILSVRC-2013 winning submission Clarifai, which achieved 11.2% with outside training data and 11.7% without it.我们的结果对于分类任务获胜者(GoogLeNet具有6.7%的错误率)也具有竞争力,并且大大优于ILSVRC-2013获胜者Clarifai的提交,其使用外部训练数据取得了11.2%的错误率,没有外部数据则为11.7%。
- Notably, we did not depart from the classical ConvNet architecture of LeCun et al. (1989), but improved it by substantially increasing the depth.值得注意的是,我们并没有偏离LeCun(1989)等人经典的ConvNet架构,但通过大幅增加深度改善了它。
- It was demonstrated that the representation depth is beneficial for the classification accuracy, and that state-of-the-art performance on the ImageNet challenge dataset can be achieved using a conventional ConvNet architecture (LeCun et al. , 1989; Krizhevsky et al. , 2012) with substantially increased depth.已经证明,表示深度有利于分类精度,并且深度大大增加的传统ConvNet架构(LeCun等,1989;Krizhevsky等,2012)可以实现ImageNet挑战数据集上的最佳性能。
- As can be seen from Table 9, application of the localisation ConvNet to the whole image substantially improves the results compared to using a center crop (Table 8), despite using the top-5 predicted class labels instead of the ground truth.从表9可以看出,与使用中心裁剪(表8)相比,将本地化ConvNet应用于整个图像显著改善了结果,尽管使用了前5个预测的类别标签而不是真实值。
|
19 | PCR (5) | [!≈ pi: si: ɑ:(r)] |
- There is a choice of whether the bounding box prediction is shared across all classes (single-class regression, SCR (Sermanet et al. , 2014)) or is class-specific (per-class regression, PCR).可以选择边界框预测是跨所有类别共享(单个类别回归,SCR(Sermanet et al.,2014))或是特定类别(逐个类别回归,PCR)。
- Settings comparison. As can be seen from Table 8, per-class regression (PCR) outperforms the class-agnostic single-class regression (SCR), which differs from the findings of Sermanet et al. (2014), where PCR was outperformed by SCR.设置比较。从表8可以看出,逐类回归(PCR)优于类不可知的单类回归(SCR),这与Sermanet等人(2014)的发现不同,后者的PCR表现优于SCR。
- Settings comparison. As can be seen from Table 8, per-class regression (PCR) outperforms the class-agnostic single-class regression (SCR), which differs from the findings of Sermanet et al. (2014), where PCR was outperformed by SCR.设置比较。从表8可以看出,逐类回归(PCR)优于类不可知的单类回归(SCR),这与Sermanet等人(2014)的发现不同,后者的PCR表现优于SCR。
- All ConvNet layers (except for the last one) have the configuration D (Table 1), while the last layer performs either single-class regression (SCR) or per-class regression (PCR).所有ConvNet层(最后一层除外)都使用配置D(表1),而最后一层执行单类回归(SCR)或逐类回归(PCR)。
- Having determined the best localisation setting (PCR, fine-tuning of all layers), we now apply it in the fully-fledged scenario, where the top-5 class labels are predicted using our best-performing classification system (Sect. 4.5), and multiple densely-computed bounding box predictions are merged using the method of Sermanet et al. (2014).在确定了最佳定位设置(PCR,所有层的微调)之后,我们现在将其应用于完全成熟的场景中,其中使用我们性能最佳的分类系统预测的top-5个类别标签(Sect. 4.5),并且使用Sermanet等人(2014年)的方法合并多个密集计算的边界框预测。
|
20 | i.e. (4) | [ˌaɪ ˈi:] |
- The convolution stride is fixed to 1 pixel; the spatial padding of conv. layer input is such that the spatial resolution is preserved after convolution, i.e. the padding is 1 pixel for 3 × 3 conv.卷积步长固定为1个像素;卷积层输入的空间填充要满足卷积之后保留空间分辨率,即3×3卷积层的填充为1个像素。
- Second, we decrease the number of parameters: assuming that both the input and the output of a three-layer 3 × 3 convolution stack has C channels, the stack is parametrised by $3(3^2C^2)=27C^2$ weights; at the same time, a single 7 × 7 conv. layer would require $7^2C^2=49C^2$ parameters, i.e. 81% more.其次,我们减少参数的数量:假设三层3×3卷积堆叠的输入和输出有C个通道,堆叠卷积层的参数为$3(3^2C^2)=27C^2$个权重;同时,单个7×7卷积层将需要$7^2C^2=49C^2$个参数,即参数多81%。
- The former is a multi-class classification error, i.e. the proportion of incorrectly classified images; the latter is the main evaluation criterion used in ILSVRC, and is computed as the proportion of images such that the ground-truth category is outside the top-5 predicted categories.前者是多类分类误差,即不正确分类图像的比例;后者是ILSVRC中使用的主要评估标准,并且计算为图像真实类别在前5个预测类别之外的图像比例。
- The localisation error is measured according to the ILSVRC criterion (Russakovsky et al. , 2014), i.e. the bounding box prediction is deemed correct if its intersection over union ratio with the ground-truth bounding box is above 0.5.根据ILSVRC标准测量定位误差(Russakovsky等人,2014),即如果边界框预测与实际边界框的相交超过并比大于0.5,则认为其是正确的。
|
21 | Szegedy (4) | |
- GoogLeNet (Szegedy et al. , 2014), a top-performing entry of the ILSVRC-2014 classification task, was developed independently of our work, but is similar in that it is based on very deep ConvNets(22 weight layers) and small convolution filters (apart from 3 × 3, they also use 1 × 1 and 5 × 5 convolutions).GooLeNet(Szegedy等,2014),ILSVRC-2014分类任务的表现最好的项目,是独立于我们工作之外的开发的,但是类似的是它是基于非常深的ConvNets(22个权重层)和小卷积滤波器(除了3×3,它们也使用了1×1和5×5卷积)。
- As will be shown in Sect. 4.5, our model is outperforming that of Szegedy et al. (2014) in terms of the single-network classification accuracy.正如将在第4.5节显示的那样,我们的模型在单网络分类精度方面胜过Szegedy等人(2014)。
- At the same time, using a large set of crops, as done by Szegedy et al. (2014), can lead to improved accuracy, as it results in a finer sampling of the input image compared to the fully-convolutional net.同时,如Szegedy等人(2014)所做的那样,使用大量的裁剪图像可以提高准确度,因为与全卷积网络相比,它使输入图像的采样更精细。
- While we believe that in practice the increased computation time of multiple crops does not justify the potential gains in accuracy, for reference we also evaluate our networks using 50 crops per scale (5 × 5 regular grid with 2 flips), for a total of 150 crops over 3 scales, which is comparable to 144 crops over 4 scales used by Szegedy et al. (2014).虽然我们认为在实践中,多裁剪图像的计算时间增加并不足以证明准确性的潜在收益,但作为参考,我们还在每个尺度使用50个裁剪图像(5×5规则网格,2次翻转)评估了我们的网络,在3个尺度上总共150个裁剪图像,与Szegedy等人(2014)在4个尺度上使用的144个裁剪图像。
|
22 | Russakovsky (4) | |
- Certain experiments were also carried out on the test set and submitted to the official ILSVRC server as a “VGG” team entry to the ILSVRC-2014 competition (Russakovsky et al. , 2014).在测试集上也进行了一些实验,并将其作为ILSVRC-2014竞赛(Russakovsky等,2014)“VGG”小组的输入提交到了官方的ILSVRC服务器。
- In the classification task of ILSVRC-2014 challenge (Russakovsky et al. , 2014), our “VGG” team secured the 2nd place with 7.3% test error using an ensemble of 7 models.在ILSVRC-2014挑战的分类任务(Russakovsky等,2014)中,我们的“VGG”团队获得了第二名,使用7个模型的组合取得了7.3%测试误差。
- The localisation error is measured according to the ILSVRC criterion (Russakovsky et al. , 2014), i.e. the bounding box prediction is deemed correct if its intersection over union ratio with the ground-truth bounding box is above 0.5.根据ILSVRC标准测量定位误差(Russakovsky等人,2014),即如果边界框预测与实际边界框的相交超过并比大于0.5,则认为其是正确的。
- With 25.3% test error, our “VGG” team won the localisation challenge of ILSVRC-2014 (Russakovsky et al. , 2014).以25.3%的测试误差,我们的“VGG”团队赢得了ILSVRC-2014(Russakovsky等,2014)的本地化挑战。
|
23 | ensemble (4) | [ɒnˈsɒmbl] |
- The resulting ensemble of 7 networks has 7.3% ILSVRC test error.由此产生的7个网络组合具有7.3%的ILSVRC测试误差。
- After the submission, we considered an ensemble of only two best-performing multi-scale models (configurations D and E), which reduced the test error to 7.0% using dense evaluation and 6.8% using combined dense and multi-crop evaluation.在提交之后,我们考虑了只有两个表现最好的多尺度模型(配置D和E)的组合,它使用密集评估将测试误差降低到7.0%,使用密集评估和多裁剪图像评估将测试误差降低到6.8%。
- In the classification task of ILSVRC-2014 challenge (Russakovsky et al. , 2014), our “VGG” team secured the 2nd place with 7.3% test error using an ensemble of 7 models.在ILSVRC-2014挑战的分类任务(Russakovsky等,2014)中,我们的“VGG”团队获得了第二名,使用7个模型的组合取得了7.3%测试误差。
- After the submission, we decreased the error rate to 6.8% using an ensemble of 2 models.提交后,我们使用2个模型的组合将错误率降低到6.8%。
|
24 | SCR (4) | [!≈ es si: ɑ:(r)] |
- There is a choice of whether the bounding box prediction is shared across all classes (single-class regression, SCR (Sermanet et al. , 2014)) or is class-specific (per-class regression, PCR).可以选择边界框预测是跨所有类别共享(单个类别回归,SCR(Sermanet et al.,2014))或是特定类别(逐个类别回归,PCR)。
- Settings comparison. As can be seen from Table 8, per-class regression (PCR) outperforms the class-agnostic single-class regression (SCR), which differs from the findings of Sermanet et al. (2014), where PCR was outperformed by SCR.设置比较。从表8可以看出,逐类回归(PCR)优于类不可知的单类回归(SCR),这与Sermanet等人(2014)的发现不同,后者的PCR表现优于SCR。
- Settings comparison. As can be seen from Table 8, per-class regression (PCR) outperforms the class-agnostic single-class regression (SCR), which differs from the findings of Sermanet et al. (2014), where PCR was outperformed by SCR.设置比较。从表8可以看出,逐类回归(PCR)优于类不可知的单类回归(SCR),这与Sermanet等人(2014)的发现不同,后者的PCR表现优于SCR。
- All ConvNet layers (except for the last one) have the configuration D (Table 1), while the last layer performs either single-class regression (SCR) or per-class regression (PCR).所有ConvNet层(最后一层除外)都使用配置D(表1),而最后一层执行单类回归(SCR)或逐类回归(PCR)。
|
25 | fully-fledged (4) | ['fʊli:fl'edʒd] |
- The second, fully-fledged, testing procedure is based on the dense application of the localisation ConvNet to the whole image, similarly to the classification task (Sect. 3.2).第二个全面的测试程序基于定位网络ConvNet对整个图像的密集应用,类似于分类任务(3.2节)。
- In this section we first determine the best-performing localisation setting (using the first test protocol), and then evaluate it in a fully-fledged scenario (the second protocol).在本节中,我们首先确定性能最佳的本地化设置(使用第一个测试协议),然后在完全成熟的场景(第二个协议)中对其进行评估。
- Fully-fledged evaluation.全面评估。
- Having determined the best localisation setting (PCR, fine-tuning of all layers), we now apply it in the fully-fledged scenario, where the top-5 class labels are predicted using our best-performing classification system (Sect. 4.5), and multiple densely-computed bounding box predictions are merged using the method of Sermanet et al. (2014).在确定了最佳定位设置(PCR,所有层的微调)之后,我们现在将其应用于完全成熟的场景中,其中使用我们性能最佳的分类系统预测的top-5个类别标签(Sect. 4.5),并且使用Sermanet等人(2014年)的方法合并多个密集计算的边界框预测。
|
26 | Chatfield (4) | |
- Recently, there has been a lot of interest in such a use case (Zeiler & Fergus, 2013; Donahue et al. , 2013; Razavian et al. , 2014; Chatfield et al. , 2014), as it turns out that deep image representations, learnt on ILSVRC, generalise well to other datasets, where they have outperformed hand-crafted representations by a large margin.最近,人们对这样一个用例非常感兴趣(Zeiler&Fergus,2013;Donahue等人,2013;Razvian等人,2014;Chatfield等人,2014),因为事实证明,在ILSVRC上学习的深层图像表示很好地推广到其他数据集,在这些数据集中,它们的表现远远超过手工表示。
- Our methods set the new state of the art across image representations, pretrained on the ILSVRC dataset, outperforming the previous best result of Chatfield et al. (2014) by more than 6%.我们的方法在图像表示上设置了新的技术状态,在ILSVRC数据集上进行了预训练,性能优于Chatfield等人(2014)之前的最佳结果有超过6%。
- Following Chatfield et al. (2014); Zeiler & Fergus (2013); He et al. (2014), on Caltech-101 we generated 3 random splits into training and test data, so that each split contains 30 training images per class, and up to 50 test images per class.遵循Chatfield等人(2014);Zeiler&Fergus(2013);He等人(2014),在Caltech-101上,我们将3个随机分割生成到训练和测试数据中,因此每个分割包含每个类30个训练图像,每个类最多包含50个测试图像。
- On Caltech-256, our features outperform the state of the art (Chatfield et al. , 2014) by a large margin (8.6%).在Caltech-256上,我们的功能远远超过最先进的技术(Chatfield等人,2014)(8.6%)。
|
27 | Net-D (4) | [!≈ net di:] |
- In this evaluation, we consider two models with the best classification performance on ILSVRC(Sect. 4) – configurations “Net-D” and “Net-E” (which we made publicly available).在这次评估中,我们考虑了在ILSVRC上具有最佳分类性能的两个模型(Sect.4)-配置“Net-D”和“Net-E”(我们公开提供)。
- Our networks “Net-D” and “Net-E” exhibit identical performance on VOC datasets, and their combination slightly improves the results.我们的网络“Net-D”和“Net-E”在VOC数据集上表现出相同的性能,并且它们的组合稍微改善了结果。
- As can be seen, the deeper 19-layer Net-E performs better than the 16-layer Net-D, and their combination further improves the performance.可以看出,较深的19层Net-E比16层的Net-D表现得更好,它们的组合进一步提高了性能。
- We also evaluated our best-performing image representation (the stacking of Net-D and Net-E features) on the PASCAL VOC-2012 action classification task (Everingham et al. , 2015), which consists in predicting an action class from a single image, given a bounding box of the person performing the action.我们还在Pascal VOC-2012动作分类任务(Everingham等人,2015)上评估了我们的最佳性能图像表示(Net-D和Net-E特征的叠加),该任务包括从单个图像预测动作类,给定执行者的边界框。
|
28 | Net-E (4) | [!≈ net i:] |
- In this evaluation, we consider two models with the best classification performance on ILSVRC(Sect. 4) – configurations “Net-D” and “Net-E” (which we made publicly available).在这次评估中,我们考虑了在ILSVRC上具有最佳分类性能的两个模型(Sect.4)-配置“Net-D”和“Net-E”(我们公开提供)。
- Our networks “Net-D” and “Net-E” exhibit identical performance on VOC datasets, and their combination slightly improves the results.我们的网络“Net-D”和“Net-E”在VOC数据集上表现出相同的性能,并且它们的组合稍微改善了结果。
- As can be seen, the deeper 19-layer Net-E performs better than the 16-layer Net-D, and their combination further improves the performance.可以看出,较深的19层Net-E比16层的Net-D表现得更好,它们的组合进一步提高了性能。
- We also evaluated our best-performing image representation (the stacking of Net-D and Net-E features) on the PASCAL VOC-2012 action classification task (Everingham et al. , 2015), which consists in predicting an action class from a single image, given a bounding box of the person performing the action.我们还在Pascal VOC-2012动作分类任务(Everingham等人,2015)上评估了我们的最佳性能图像表示(Net-D和Net-E特征的叠加),该任务包括从单个图像预测动作类,给定执行者的边界框。
|
29 | PASCAL (4) | ['pæskәl] |
- We begin with the evaluation on the image classification task of PASCAL VOC-2007 and VOC-2012 benchmarks (Everingham et al. , 2015).我们首先评估Pascal VOC-2007和VOC-2012基准的图像分类任务(Everingham等,2015)。
- We also evaluated our best-performing image representation (the stacking of Net-D and Net-E features) on the PASCAL VOC-2012 action classification task (Everingham et al. , 2015), which consists in predicting an action class from a single image, given a bounding box of the person performing the action.我们还在Pascal VOC-2012动作分类任务(Everingham等人,2015)上评估了我们的最佳性能图像表示(Net-D和Net-E特征的叠加),该任务包括从单个图像预测动作类,给定执行者的边界框。
- v3 Adds generalisation experiments (Appendix B) on PASCAL VOC and Caltech image classification datasets.V3增加了对Pascal VOC和Caltech图像分类数据集的泛化实验(附录B)。
- Adds a comparison of the net B with a shallow net and the results on PASCAL VOC action classification benchmark.添加网络B与浅层网络的比较以及Pascal VOC动作分类基准的结果。
|
30 | applicable (3) | [əˈplɪkəbl] |
- As a result, we come up with significantly more accurate ConvNet architectures, which not only achieve the state-of-the-art accuracy on ILSVRC classification and localisation tasks, but are also applicable to other image recognition datasets, where they achieve excellent performance even when used as a part of a relatively simple pipelines (e.g. deep features classified by a linear SVM without fine-tuning).因此,我们提出了更为精确的ConvNet架构,不仅可以在ILSVRC分类和定位任务上取得的最佳的准确性,而且还适用于其它的图像识别数据集,它们可以获得优异的性能,即使使用相对简单流程的一部分(例如,通过线性SVM分类深度特征而不进行微调)。
- Where applicable, the parameters for the LRN layer are those of (Krizhevsky et al. , 2012).在应用的地方,LRN层的参数是(Krizhevsky等,2012)的参数。
- For random initialisation (where applicable), we sampled the weights from a normal distribution with the zero mean and $10^{−2}$ variance.对于随机初始化(如果应用),我们从均值为0和方差为$10^{−2}$的正态分布中采样权重。
|
31 | generalisation (3) | [ˌdʒenərəlaɪ'zeɪʃən] |
- For completeness, we also describe and assess our ILSVRC-2014 object localisation system in Appendix A, and discuss the generalisation of very deep features to other datasets in Appendix B.为了完整起见,我们还将在附录A中描述和评估我们的ILSVRC-2014目标定位系统,并在附录B中讨论了非常深的特征在其它数据集上的泛化。
- B GENERALISATION OF VERY DEEP FEATURESB 非常深层特征的概括
- v3 Adds generalisation experiments (Appendix B) on PASCAL VOC and Caltech image classification datasets.V3增加了对Pascal VOC和Caltech图像分类数据集的泛化实验(附录B)。
|
32 | revision (3) | [rɪˈvɪʒn] |
- Finally, Appendix C contains the list of major paper revisions.最后,附录C包含了主要的论文修订列表。
- C PAPER REVISIONSC 论文修订
- Here we present the list of major paper revisions, outlining the substantial changes for the convenience of the reader.为了方便读者,我们在这里列出了主要的论文修订列表,概述了重要的变化。
|
33 | rectification (3) | [ˌrektɪfɪ'keɪʃn] |
- All hidden layers are equipped with the rectification (ReLU (Krizhevsky et al. , 2012)) non-linearity.所有隐藏层都配备了修正(ReLU(Krizhevsky等,2012))非线性。
- First, we incorporate three non-linear rectification layers instead of a single one, which makes the decision function more discriminative.首先,我们结合了三个非线性修正层,而不是单一的,这使得决策函数更具判别性。
- layers. Even though in our case the 1 × 1 convolution is essentially a linear projection onto the space of the same dimensionality (the number of input and output channels is the same), an additional non-linearity is introduced by the rectification function.即使在我们的案例下,1×1卷积基本上是在相同维度空间上的线性投影(输入和输出通道的数量相同),由修正函数引入附加的非线性。
|
34 | incorporate (3) | [ɪnˈkɔ:pəreɪt] |
- First, we incorporate three non-linear rectification layers instead of a single one, which makes the decision function more discriminative.首先,我们结合了三个非线性修正层,而不是单一的,这使得决策函数更具判别性。
- We envisage that better localisation performance can be achieved if this technique is incorporated into our method.我们设想,如果将这种技术结合到我们的方法中,可以获得更好的定位性能。
- Unlike other approaches, we did not incorporate any task-specific heuristics, but relied on the representation power of very deep convolutional features.与其他方法不同,我们没有包含任何特定于任务的启发式方法,而是依赖于非常深的卷积特征的表示能力。
|
35 | regularisation (3) | [,reɡjulərai'zeiʃən] |
- This can be seen as imposing a regularisation on the 7 × 7 conv. filters, forcing them to have a decomposition through the 3 × 3 filters (with non-linearity injected in between).这可以看作是对7×7卷积滤波器进行正则化,迫使它们通过3×3滤波器(在它们之间注入非线性)进行分解。
- The training was regularised by weight decay (the L2 penalty multiplier set to $5 \times 10^{−4}$) and dropout regularisation for the first two fully-connected layers (dropout ratio set to 0.5).训练通过权重衰减(L2惩罚乘子设定为$5\times 10^{−4}$)进行正则化,前两个全连接层执行丢弃正则化(丢弃率设定为0.5)。
- We conjecture that in spite of the larger number of parameters and the greater depth of our nets compared to (Krizhevsky et al. , 2012), the nets required less epochs to converge due to (a) implicit regularisation imposed by greater depth and smaller conv. filter sizes; (b) pre-initialisation of certain layers.我们推测,尽管与(Krizhevsky等,2012)相比我们的网络参数更多,网络的深度更大,但网络需要更小的epoch就可以收敛,这是由于(a)由更大的深度和更小的卷积滤波器尺寸引起的隐式正则化,(b)某些层的预初始化。
|
36 | GoogLeNet (3) | |
- GoogLeNet (Szegedy et al. , 2014), a top-performing entry of the ILSVRC-2014 classification task, was developed independently of our work, but is similar in that it is based on very deep ConvNets(22 weight layers) and small convolution filters (apart from 3 × 3, they also use 1 × 1 and 5 × 5 convolutions).GooLeNet(Szegedy等,2014),ILSVRC-2014分类任务的表现最好的项目,是独立于我们工作之外的开发的,但是类似的是它是基于非常深的ConvNets(22个权重层)和小卷积滤波器(除了3×3,它们也使用了1×1和5×5卷积)。
- Our result is also competitive with respect to the classification task winner (GoogLeNet with 6.7% error) and substantially outperforms the ILSVRC-2013 winning submission Clarifai, which achieved 11.2% with outside training data and 11.7% without it.我们的结果对于分类任务获胜者(GoogLeNet具有6.7%的错误率)也具有竞争力,并且大大优于ILSVRC-2013获胜者Clarifai的提交,其使用外部训练数据取得了11.2%的错误率,没有外部数据则为11.7%。
- In terms of the single-net performance, our architecture achieves the best result (7.0% test error), outperforming a single GoogLeNet by 0.9%.在单网络性能方面,我们的架构取得了最好节果(7.0%测试误差),超过单个GoogLeNet 0.9%。
|
37 | LeCun (3) | |
- Namely, the training is carried out by optimising the multinomial logistic regression objective using mini-batch gradient descent (based on back-propagation (LeCun et al. , 1989)) with momentum.也就是说,通过使用具有动量的小批量梯度下降(基于反向传播(LeCun等人,1989))优化多项式逻辑回归目标函数来进行训练。
- Notably, we did not depart from the classical ConvNet architecture of LeCun et al. (1989), but improved it by substantially increasing the depth.值得注意的是,我们并没有偏离LeCun(1989)等人经典的ConvNet架构,但通过大幅增加深度改善了它。
- It was demonstrated that the representation depth is beneficial for the classification accuracy, and that state-of-the-art performance on the ImageNet challenge dataset can be achieved using a conventional ConvNet architecture (LeCun et al. , 1989; Krizhevsky et al. , 2012) with substantially increased depth.已经证明,表示深度有利于分类精度,并且深度大大增加的传统ConvNet架构(LeCun等,1989;Krizhevsky等,2012)可以实现ImageNet挑战数据集上的最佳性能。
|
38 | augmentation (3) | [ˌɔ:ɡmen'teɪʃn] |
- This can also be seen as training set augmentation by scale jittering, where a single model is trained to recognise objects over a wide range of scales.这也可以看作是通过尺度抖动进行训练集增强,其中单个模型被训练在一定尺度范围内识别对象。
- This confirms that training set augmentation by scale jittering is indeed helpful for capturing multi-scale image statistics.这证实了通过尺度抖动进行的训练集增强确实有助于捕获多尺度图像统计。
- v2 Adds post-submission ILSVRC experiments with training set augmentation using scale jittering, which improves the performance.V2增加了提交后的ILSVRC实验,使用比例抖动增强训练集,从而提高了性能。
|
39 | posterior (3) | [pɒˈstɪəriə(r)] |
- We also augment the test set by horizontal flipping of the images; the soft-max class posteriors of the original and flipped images are averaged to obtain the final scores for the image.我们还通过水平翻转图像来增强测试集;将原始图像和翻转图像的soft-max类后验进行平均,以获得图像的最终分数。
- It consists of running a model over several rescaled versions of a test image (corresponding to different values of Q), followed by averaging the resulting class posteriors.它包括在一张测试图像的几个归一化版本上运行模型(对应于不同的Q值),然后对所得到的类别后验进行平均。
- In this part of the experiments, we combine the outputs of several models by averaging their soft-max class posteriors.在这部分实验中,我们通过对soft-max类别后验进行平均,结合了几种模型的输出。
|
40 | aggregate (3) | [ˈægrɪgət] |
- To utilise the ConvNets, pre-trained on ILSVRC, for image classification on other datasets, we remove the last fully-connected layer (which performs 1000-way ILSVRC classification), and use 4096-D activations of the penultimate layer as image features, which are aggregated across multiple locations and scales.为了利用在ILSVRC上预先训练的ConvNets对其他数据集进行图像分类,我们删除了最后一个完全连接的层(它执行1000种ILSVRC分类),并使用倒数第二层的4096-D激活作为图像特征,这些图像特征在多个位置和规模上聚合。
- Notably, by examining the performance on the validation sets of VOC-2007 and VOC-2012, we found that aggregating image descriptors, computed at multiple scales, by averaging performs similarly to the aggregation by stacking.值得注意的是,通过检查VOC-2007和VOC-2012验证集的性能,我们发现通过平均来聚合在多个比例下计算的图像描述符的性能类似于通过堆叠进行聚合。
- Since averaging has a benefit of not inflating the descriptor dimensionality, we were able to aggregated image descriptors over a wide range of scales: $Q \in \{256, 384, 512, 640, 768\}$.由于平均具有不膨胀描述符维度的优点,我们能够在广泛的范围内聚合图像描述符:$Q \in \{256,384,512,640,768\}$中。
|
41 | thorough (2) | [ˈθʌrə] |
- Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3 × 3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16–19 weight layers.我们的主要贡献是使用非常小的(3×3)卷积滤波器架构对网络深度的增加进行了全面评估,这表明通过将深度推到16-19加权层可以实现对现有技术配置的显著改进。
- In the main body of the paper we have considered the classification task of the ILSVRC challenge, and performed a thorough evaluation of ConvNet architectures of different depth.在论文的主体部分,我们考虑了ILSVRC挑战的分类任务,并对不同深度的ConvNet架构进行了深入的评估。
|
42 | e.g. (2) | [ˌi: ˈdʒi:] |
- As a result, we come up with significantly more accurate ConvNet architectures, which not only achieve the state-of-the-art accuracy on ILSVRC classification and localisation tasks, but are also applicable to other image recognition datasets, where they achieve excellent performance even when used as a part of a relatively simple pipelines (e.g. deep features classified by a linear SVM without fine-tuning).因此,我们提出了更为精确的ConvNet架构,不仅可以在ILSVRC分类和定位任务上取得的最佳的准确性,而且还适用于其它的图像识别数据集,它们可以获得优异的性能,即使使用相对简单流程的一部分(例如,通过线性SVM分类深度特征而不进行微调)。
- Rather than using relatively large receptive fields in the first conv. layers (e.g. 11 × 11 with stride 4 in (Krizhevsky et al. , 2012), or 7 × 7 with stride 2 in (Zeiler & Fergus, 2013; Sermanet et al. , 2014)), we use very small 3 × 3 receptive fields throughout the whole net, which are convolved with the input at every pixel (with stride 1).不是在第一卷积层中使用相对较大的感受野(例如,在(Krizhevsky等人,2012)中的11×11,步长为4,或在(Zeiler&Fergus,2013;Sermanet等,2014)中的7×7,步长为2),我们在整个网络使用非常小的3×3感受野,与输入的每个像素(步长为1)进行卷积。
|
43 | SVM (2) | [!≈ es vi: em] |
- As a result, we come up with significantly more accurate ConvNet architectures, which not only achieve the state-of-the-art accuracy on ILSVRC classification and localisation tasks, but are also applicable to other image recognition datasets, where they achieve excellent performance even when used as a part of a relatively simple pipelines (e.g. deep features classified by a linear SVM without fine-tuning).因此,我们提出了更为精确的ConvNet架构,不仅可以在ILSVRC分类和定位任务上取得的最佳的准确性,而且还适用于其它的图像识别数据集,它们可以获得优异的性能,即使使用相对简单流程的一部分(例如,通过线性SVM分类深度特征而不进行微调)。
- The resulting image descriptor is L2-normalised and combined with a linear SVM classifier, trained on the target dataset.得到的图像描述符是L2归一化的,并与线性SVM分类器结合,在目标数据集上训练。
|
44 | Ciresan (2) | |
- To measure the improvement brought by the increased ConvNet depth in a fair setting, all our ConvNet layer configurations are designed using the same principles, inspired by Ciresan et al. (2011); Krizhevsky et al. (2012).为了衡量ConvNet深度在公平环境中所带来的改进,我们所有的ConvNet层配置都使用相同的规则,灵感来自Ciresan等(2011);Krizhevsky等人(2012年)。
- Small-size convolution filters have been previously used by Ciresan et al. (2011), but their nets are significantly less deep than ours, and they did not evaluate on the large-scale ILSVRC dataset.Ciresan等人(2011)以前使用小尺寸的卷积滤波器,但是他们的网络深度远远低于我们的网络,他们并没有在大规模的ILSVRC数据集上进行评估。
|
45 | generic (2) | [dʒəˈnerɪk] |
- In this section, we first describe a generic layout of our ConvNet configurations (Sect. 2.1) and then detail the specific configurations used in the evaluation (Sect. 2.2).在本节中,我们首先描述我们的ConvNet配置的通用设计(第2.1节),然后详细说明评估中使用的具体配置(第2.2节)。
- All configurations follow the generic design presented in Sect. 2.1, and differ only in the depth: from 11 weight layers in the network A (8 conv. and 3 FC layers) to 19 weight layers in the network E (16 conv.所有配置都遵循2.1节提出的通用设计,并且仅是深度不同:从网络A中的11个加权层(8个卷积层和3个FC层)到网络E中的19个加权层(16个卷积层和3个FC层)。
|
46 | LRN (2) | [!≈ el ɑ:(r) en] |
- We note that none of our networks (except for one) contain Local Response Normalisation (LRN) normalisation (Krizhevsky et al. , 2012): as will be shown in Sect. 4, such normalisation does not improve the performance on the ILSVRC dataset, but leads to increased memory consumption and computation time.我们注意到,我们的网络(除了一个)都不包含局部响应规范化(LRN)(Krizhevsky等,2012):将在第4节看到,这种规范化并不能提高在ILSVRC数据集上的性能,但增加了内存消耗和计算时间。
- Where applicable, the parameters for the LRN layer are those of (Krizhevsky et al. , 2012).在应用的地方,LRN层的参数是(Krizhevsky等,2012)的参数。
|
47 | brevity (2) | [ˈbrevəti] |
- The ReLU activation function is not shown for brevity.为了简洁起见,不显示ReLU激活功能。
- In these experiments, the smallest images side was set to S = 384; the results with S = 256 exhibit the same behaviour and are not shown for brevity.在这些实验中,最小图像侧被设置为S=384;S=256的结果表现出相同的行为,为了简洁起见不显示。
|
48 | top-performing (2) | [!≈ tɒp pə'fɔ:mɪŋ] |
- Our ConvNet configurations are quite different from the ones used in the top-performing entries of the ILSVRC-2012 (Krizhevsky et al. , 2012) and ILSVRC-2013 competitions (Zeiler & Fergus, 2013; Sermanet et al. , 2014).我们的ConvNet配置与ILSVRC-2012(Krizhevsky等,2012)和ILSVRC-2013比赛(Zeiler&Fergus,2013;Sermanet等,2014)表现最佳的参赛提交中使用的ConvNet配置有很大不同。
- GoogLeNet (Szegedy et al. , 2014), a top-performing entry of the ILSVRC-2014 classification task, was developed independently of our work, but is similar in that it is based on very deep ConvNets(22 weight layers) and small convolution filters (apart from 3 × 3, they also use 1 × 1 and 5 × 5 convolutions).GooLeNet(Szegedy等,2014),ILSVRC-2014分类任务的表现最好的项目,是独立于我们工作之外的开发的,但是类似的是它是基于非常深的ConvNets(22个权重层)和小卷积滤波器(除了3×3,它们也使用了1×1和5×5卷积)。
|
49 | convolved (2) | [kənˈvɔlvd] |
- Rather than using relatively large receptive fields in the first conv. layers (e.g. 11 × 11 with stride 4 in (Krizhevsky et al. , 2012), or 7 × 7 with stride 2 in (Zeiler & Fergus, 2013; Sermanet et al. , 2014)), we use very small 3 × 3 receptive fields throughout the whole net, which are convolved with the input at every pixel (with stride 1).不是在第一卷积层中使用相对较大的感受野(例如,在(Krizhevsky等人,2012)中的11×11,步长为4,或在(Zeiler&Fergus,2013;Sermanet等,2014)中的7×7,步长为2),我们在整个网络使用非常小的3×3感受野,与输入的每个像素(步长为1)进行卷积。
- Also, multi-crop evaluation is complementary to dense evaluation due to different convolution boundary conditions: when applying a ConvNet to a crop, the convolved feature maps are padded with zeros, while in the case of dense evaluation the padding for the same crop naturally comes from the neighbouring parts of an image (due to both the convolutions and spatial pooling), which substantially increases the overall network receptive field, so more context is captured.此外,由于不同的卷积边界条件,多裁剪图像评估是密集评估的补充:当将ConvNet应用于裁剪图像时,卷积特征图用零填充,而在密集评估的情况下,相同裁剪图像的填充自然会来自于图像的相邻部分(由于卷积和空间池化),这大大增加了整个网络的感受野,因此捕获了更多的上下文。
|
50 | momentum (2) | [məˈmentəm] |
- Namely, the training is carried out by optimising the multinomial logistic regression objective using mini-batch gradient descent (based on back-propagation (LeCun et al. , 1989)) with momentum.也就是说,通过使用具有动量的小批量梯度下降(基于反向传播(LeCun等人,1989))优化多项式逻辑回归目标函数来进行训练。
- The batch size was set to 256, momentum to 0.9.批量大小设为256,动量为0.9。
|
51 | epoch (2) | [ˈi:pɒk] |
- In total, the learning rate was decreased 3 times, and the learning was stopped after 370K iterations (74 epochs).学习率总共降低3次,学习在37万次迭代后停止(74个epochs)。
- We conjecture that in spite of the larger number of parameters and the greater depth of our nets compared to (Krizhevsky et al. , 2012), the nets required less epochs to converge due to (a) implicit regularisation imposed by greater depth and smaller conv. filter sizes; (b) pre-initialisation of certain layers.我们推测,尽管与(Krizhevsky等,2012)相比我们的网络参数更多,网络的深度更大,但网络需要更小的epoch就可以收敛,这是由于(a)由更大的深度和更小的卷积滤波器尺寸引起的隐式正则化,(b)某些层的预初始化。
|
52 | augment (2) | [ɔ:gˈment] |
- To further augment the training set, the crops underwent random horizontal flipping and random RGB colour shift (Krizhevsky et al. , 2012).为了进一步增强训练集,裁剪图像经过了随机水平翻转和随机RGB颜色偏移(Krizhevsky等,2012)。
- We also augment the test set by horizontal flipping of the images; the soft-max class posteriors of the original and flipped images are averaged to obtain the final scores for the image.我们还通过水平翻转图像来增强测试集;将原始图像和翻转图像的soft-max类后验进行平均,以获得图像的最终分数。
|
53 | uncropped (2) | [ʌn'krɒpt] |
- layers). The resulting fully-convolutional net is then applied to the whole (uncropped) image.然后将所得到的全卷积网络应用于整个(未裁剪)图像上。
- Our implementation is derived from the publicly available C++ Caffe toolbox (Jia, 2013) (branched out in December 2013), but contains a number of significant modifications, allowing us to perform training and evaluation on multiple GPUs installed in a single system, as well as train and evaluate on full-size (uncropped) images at multiple scales (as described above).我们的实现来源于公开的C++ Caffe工具箱(Jia,2013)(2013年12月推出),但包含了一些重大的修改,使我们能够对安装在单个系统中的多个GPU进行训练和评估,也能训练和评估在多个尺度上(如上所述)的全尺寸(未裁剪)图像。
|
54 | spatially (2) | ['speɪʃəlɪ] |
- Finally, to obtain a fixed-size vector of class scores for the image, the class score map is spatially averaged (sum-pooled).最后,为了获得图像的类别分数的固定大小的向量,类得分图在空间上平均(和池化)。
- To come up with the final prediction, we utilise the greedy merging procedure of Sermanet et al. (2014), which first merges spatially close predictions (by averaging their coordinates), and then rates them based on the class scores, obtained from the classification ConvNet.为了得到最终的预测,我们利用Sermanet等人(2014)的贪婪合并过程,它首先合并空间上接近的预测(通过平均它们的坐标),然后基于从分类ConvNet获得的类别得分对它们进行评级。
|
55 | complementary (2) | [ˌkɒmplɪˈmentri] |
- Also, multi-crop evaluation is complementary to dense evaluation due to different convolution boundary conditions: when applying a ConvNet to a crop, the convolved feature maps are padded with zeros, while in the case of dense evaluation the padding for the same crop naturally comes from the neighbouring parts of an image (due to both the convolutions and spatial pooling), which substantially increases the overall network receptive field, so more context is captured.此外,由于不同的卷积边界条件,多裁剪图像评估是密集评估的补充:当将ConvNet应用于裁剪图像时,卷积特征图用零填充,而在密集评估的情况下,相同裁剪图像的填充自然会来自于图像的相邻部分(由于卷积和空间池化),这大大增加了整个网络的感受野,因此捕获了更多的上下文。
- As can be seen, using multiple crops performs slightly better than dense evaluation, and the two approaches are indeed complementary, as their combination outperforms each of them.可以看出,使用多裁剪图像表现比密集评估略好,而且这两种方法确实是互补的,因为它们的组合优于其中的每一种。
|
56 | parallelism (2) | [ˈpærəlelɪzəm] |
- Multi-GPU training exploits data parallelism, and is carried out by splitting each batch of training images into several GPU batches, processed in parallel on each GPU.多GPU训练利用数据并行性,通过将每批训练图像分成几个GPU批次,每个GPU并行处理。
- While more sophisticated methods of speeding up ConvNet training have been recently proposed (Krizhevsky, 2014), which employ model and data parallelism for different layers of the net, we have found that our conceptually much simpler scheme already provides a speedup of 3.75 times on an off-the-shelf 4-GPU system, as compared to using a single GPU.最近提出了更加复杂的加速ConvNet训练的方法(Krizhevsky,2014),它们对网络的不同层之间采用模型和数据并行,我们发现我们概念上更简单的方案与使用单个GPU相比,在现有的4-GPU系统上已经提供了3.75倍的加速。
|
57 | NVIDIA (2) | [ɪn'vɪdɪə] |
- On a system equipped with four NVIDIA Titan Black GPUs, training a single net took 2–3 weeks depending on the architecture.在配备四个NVIDIA Titan Black GPU的系统上,根据架构训练单个网络需要2-3周时间。
- This work was supported by ERC grant VisRec no. 228180. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the GPUs used for this research.这项工作得到ERC授权的VisRec编号228180的支持.我们非常感谢NVIDIA公司捐赠GPU为此研究使用。
|
58 | complementarity (2) | [ˌkɒmplɪmen'tærɪtɪ] |
- We also assess the complementarity of the two evaluation techniques by averaging their soft-max outputs.我们还通过平均其soft-max输出来评估两种评估技术的互补性。
- This improves the performance due to complementarity of the models, and was used in the top ILSVRC submissions in 2012 (Krizhevsky et al. , 2012) and 2013 (Zeiler & Fergus, 2013; Sermanet et al. , 2014).由于模型的互补性,这提高了性能,并且在了2012年(Krizhevsky等,2012)和2013年(Zeiler&Fergus,2013;Sermanet等,2014)ILSVRC的顶级提交中使用。
|
59 | hypothesize (2) | [haɪˈpɒθəsaɪz] |
- As noted above, we hypothesize that this is due to a different treatment of convolution boundary conditions.如上所述,我们假设这是由于卷积边界条件的不同处理。
- We hypothesize that this is due to the fact that in the VOC dataset the objects appear over a variety of scales, so there is no particular scale-specific semantics which a classifier could exploit.我们假设这是由于在VOC数据集中对象出现在各种尺度上的事实,因此没有分类器可以利用的特定尺度语义。
|
60 | scenario (2) | [səˈnɑ:riəʊ] |
- In this section we first determine the best-performing localisation setting (using the first test protocol), and then evaluate it in a fully-fledged scenario (the second protocol).在本节中,我们首先确定性能最佳的本地化设置(使用第一个测试协议),然后在完全成熟的场景(第二个协议)中对其进行评估。
- Having determined the best localisation setting (PCR, fine-tuning of all layers), we now apply it in the fully-fledged scenario, where the top-5 class labels are predicted using our best-performing classification system (Sect. 4.5), and multiple densely-computed bounding box predictions are merged using the method of Sermanet et al. (2014).在确定了最佳定位设置(PCR,所有层的微调)之后,我们现在将其应用于完全成熟的场景中,其中使用我们性能最佳的分类系统预测的top-5个类别标签(Sect. 4.5),并且使用Sermanet等人(2014年)的方法合并多个密集计算的边界框预测。
|
61 | aggregation (2) | [ˌæɡrɪ'ɡeɪʃn] |
- Aggregation of features is carried out in a similar manner to our ILSVRC evaluation procedure (Sect. 3.2).特征的聚合是以与我们的ILSVRC评估程序类似的方式进行的(Sect.3.2)。
- Notably, by examining the performance on the validation sets of VOC-2007 and VOC-2012, we found that aggregating image descriptors, computed at multiple scales, by averaging performs similarly to the aggregation by stacking.值得注意的是,通过检查VOC-2007和VOC-2012验证集的性能,我们发现通过平均来聚合在多个比例下计算的图像描述符的性能类似于通过堆叠进行聚合。
|
62 | Everingham (2) | |
- We begin with the evaluation on the image classification task of PASCAL VOC-2007 and VOC-2012 benchmarks (Everingham et al. , 2015).我们首先评估Pascal VOC-2007和VOC-2012基准的图像分类任务(Everingham等,2015)。
- We also evaluated our best-performing image representation (the stacking of Net-D and Net-E features) on the PASCAL VOC-2012 action classification task (Everingham et al. , 2015), which consists in predicting an action class from a single image, given a bounding box of the person performing the action.我们还在Pascal VOC-2012动作分类任务(Everingham等人,2015)上评估了我们的最佳性能图像表示(Net-D和Net-E特征的叠加),该任务包括从单个图像预测动作类,给定执行者的边界框。
|
63 | scale-specific (2) | [!≈ skeɪl spəˈsɪfɪk] |
- We hypothesize that this is due to the fact that in the VOC dataset the objects appear over a variety of scales, so there is no particular scale-specific semantics which a classifier could exploit.我们假设这是由于在VOC数据集中对象出现在各种尺度上的事实,因此没有分类器可以利用的特定尺度语义。
- This can be explained by the fact that in Caltech images objects typically occupy the whole image, so multi-scale image features are semantically different (capturing the whole object vs. object parts), and stacking allows a classifier to exploit such scale-specific representations.这可以通过以下事实来解释:在Caltech图像中,对象通常占据整个图像,因此多尺度图像特征在语义上是不同的(捕获整个对象与对象部分),并且堆叠允许分类器利用这种比例特定的表示。
|
64 | semantic (2) | [sɪˈmæntɪk] |
- We hypothesize that this is due to the fact that in the VOC dataset the objects appear over a variety of scales, so there is no particular scale-specific semantics which a classifier could exploit.我们假设这是由于在VOC数据集中对象出现在各种尺度上的事实,因此没有分类器可以利用的特定尺度语义。
- Similar gains over a more shallow architecture of Krizhevsky et al. (2012) have been observed in semantic segmentation (Long et al. , 2014), image caption generation (Kiros et al. , 2014; Karpathy & Fei-Fei, 2014), texture and material recognition (Cimpoi et al. , 2014; Bell et al. , 2014).Krizhevsky等人(2012)在更浅的架构,已经在语义分割(Long等人,2014)、图像字幕生成(Kiros等人,2014;Karpathy&Fei-Fei,2014)、纹理和材料识别(Cimpoi等人,2014;Bell等人,2014)中观察到能获得类似的收益。
|
65 | semantically (2) | [sɪ'mæntɪklɪ] |
- It should be noted that the method of Wei et al. (2014), which achieves 1% better mAP on VOC-2012, is pre-trained on an extended 2000-class ILSVRC dataset, which includes additional 1000 categories, semantically close to those in VOC datasets.应该注意的是Wei等人(2014)的方法在VOC-2012上实现了1%的mAP改善,在扩展的2000类ILSVRC数据集上进行了预训练,该数据集包括另外1000个类别,在语义上接近于VOC数据集中的类别。
- This can be explained by the fact that in Caltech images objects typically occupy the whole image, so multi-scale image features are semantically different (capturing the whole object vs. object parts), and stacking allows a classifier to exploit such scale-specific representations.这可以通过以下事实来解释:在Caltech图像中,对象通常占据整个图像,因此多尺度图像特征在语义上是不同的(捕获整个对象与对象部分),并且堆叠允许分类器利用这种比例特定的表示。
|
66 | Fei-Fei (2) | |
- In this section we evaluate very deep features on Caltech-101 (Fei-Fei et al. , 2004) and Caltech-256 (Griffin et al. , 2007) image classification benchmarks.在本节中,我们评估了Caltech-101(Fei-Fei等人,2004)和Caltech-256(Griffin等人,2007)图像分类基准的非常深入的特征。
- Similar gains over a more shallow architecture of Krizhevsky et al. (2012) have been observed in semantic segmentation (Long et al. , 2014), image caption generation (Kiros et al. , 2014; Karpathy & Fei-Fei, 2014), texture and material recognition (Cimpoi et al. , 2014; Bell et al. , 2014).Krizhevsky等人(2012)在更浅的架构,已经在语义分割(Long等人,2014)、图像字幕生成(Kiros等人,2014;Karpathy&Fei-Fei,2014)、纹理和材料识别(Cimpoi等人,2014;Bell等人,2014)中观察到能获得类似的收益。
|
67 | ICLR (2) | [!≈ aɪ si: el ɑ:(r)] |
- v4 The paper is converted to ICLR-2015 submission format.V4将论文转换为ICLR-2015提交格式。
- v6 Camera-ready ICLR-2015 conference paper.V6 复印就绪ICLR-2015会议论文。
|
68 | prior-art (1) | [!≈ ˈpraɪə(r) ɑ:t] |
- Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3 × 3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16–19 weight layers.我们的主要贡献是使用非常小的(3×3)卷积滤波器架构对网络深度的增加进行了全面评估,这表明通过将深度推到16-19加权层可以实现对现有技术配置的显著改进。
|
69 | commodity (1) | [kəˈmɒdəti] |
- With ConvNets becoming more of a commodity in the computer vision field, a number of attempts have been made to improve the original architecture of Krizhevsky et al. (2012) in a bid to achieve better accuracy.随着ConvNets在计算机视觉领域越来越商品化,为了达到更好的准确性,已经进行了许多尝试来改进Krizhevsky等人(2012)最初的架构。
|
70 | Howard (1) | [ˈhauəd] |
- Another line of improvements dealt with training and testing the networks densely over the whole image and over multiple scales (Sermanet et al. , 2014; Howard, 2014).另一条改进措施在整个图像和多个尺度上对网络进行密集地训练和测试(Sermanet等,2014;Howard,2014)。
|
71 | completeness (1) | [kəm'pli:tnəs] |
- For completeness, we also describe and assess our ILSVRC-2014 object localisation system in Appendix A, and discuss the generalisation of very deep features to other datasets in Appendix B.为了完整起见,我们还将在附录A中描述和评估我们的ILSVRC-2014目标定位系统,并在附录B中讨论了非常深的特征在其它数据集上的泛化。
|
72 | A–E (1) | [!≈ ə i:] |
- In the following we will refer to the nets by their names (A–E).接下来我们将按网站名称(A-E)来提及网络。
|
73 | discriminative (1) | [dɪs'krɪmɪnətɪv] |
- First, we incorporate three non-linear rectification layers instead of a single one, which makes the decision function more discriminative.首先,我们结合了三个非线性修正层,而不是单一的,这使得决策函数更具判别性。
|
74 | parametrised (1) | |
- Second, we decrease the number of parameters: assuming that both the input and the output of a three-layer 3 × 3 convolution stack has C channels, the stack is parametrised by $3(3^2C^2)=27C^2$ weights; at the same time, a single 7 × 7 conv. layer would require $7^2C^2=49C^2$ parameters, i.e. 81% more.其次,我们减少参数的数量:假设三层3×3卷积堆叠的输入和输出有C个通道,堆叠卷积层的参数为$3(3^2C^2)=27C^2$个权重;同时,单个7×7卷积层将需要$7^2C^2=49C^2$个参数,即参数多81%。
|
75 | decomposition (1) | [ˌdi:kɒmpə'zɪʃn] |
- This can be seen as imposing a regularisation on the 7 × 7 conv. filters, forcing them to have a decomposition through the 3 × 3 filters (with non-linearity injected in between).这可以看作是对7×7卷积滤波器进行正则化,迫使它们通过3×3滤波器(在它们之间注入非线性)进行分解。
|
76 | incorporation (1) | [ɪnˌkɔ:pə'reɪʃn] |
- The incorporation of 1 × 1 conv. layers (configuration C, Table 1) is a way to increase the non-linearity of the decision function without affecting the receptive fields of the conv.结合1×1卷积层(配置C,表1)是增加决策函数非线性而不影响卷积层感受野的一种方式。
|
77 | Goodfellow (1) | |
- Goodfellow et al. (2014) applied deep ConvNets (11 weight layers) to the task of street number recognition, and showed that the increased depth led to better performance.Goodfellow等人(2014)在街道号识别任务中采用深层ConvNets(11个权重层),显示出增加的深度导致了更好的性能。
|
78 | topology (1) | [tə'pɒlədʒɪ] |
- Their network topology is, however, more complex than ours, and the spatial resolution of the feature maps is reduced more aggressively in the first layers to decrease the amount of computation.然而,它们的网络拓扑结构比我们的更复杂,并且在第一层中特征图的空间分辨率被更积极地减少,以减少计算量。
|
79 | optimise (1) | ['ɒptɪmaɪz] |
- Namely, the training is carried out by optimising the multinomial logistic regression objective using mini-batch gradient descent (based on back-propagation (LeCun et al. , 1989)) with momentum.也就是说,通过使用具有动量的小批量梯度下降(基于反向传播(LeCun等人,1989))优化多项式逻辑回归目标函数来进行训练。
|
80 | multinomial (1) | [ˌmʌltɪ'nəʊmɪəl] |
- Namely, the training is carried out by optimising the multinomial logistic regression objective using mini-batch gradient descent (based on back-propagation (LeCun et al. , 1989)) with momentum.也就是说,通过使用具有动量的小批量梯度下降(基于反向传播(LeCun等人,1989))优化多项式逻辑回归目标函数来进行训练。
|
81 | descent (1) | [dɪˈsent] |
- Namely, the training is carried out by optimising the multinomial logistic regression objective using mini-batch gradient descent (based on back-propagation (LeCun et al. , 1989)) with momentum.也就是说,通过使用具有动量的小批量梯度下降(基于反向传播(LeCun等人,1989))优化多项式逻辑回归目标函数来进行训练。
|
82 | regularise (1) | ['regjʊləraɪz] |
- The training was regularised by weight decay (the L2 penalty multiplier set to $5 \times 10^{−4}$) and dropout regularisation for the first two fully-connected layers (dropout ratio set to 0.5).训练通过权重衰减(L2惩罚乘子设定为$5\times 10^{−4}$)进行正则化,前两个全连接层执行丢弃正则化(丢弃率设定为0.5)。
|
83 | multiplier (1) | [ˈmʌltɪplaɪə(r)] |
- The training was regularised by weight decay (the L2 penalty multiplier set to $5 \times 10^{−4}$) and dropout regularisation for the first two fully-connected layers (dropout ratio set to 0.5).训练通过权重衰减(L2惩罚乘子设定为$5\times 10^{−4}$)进行正则化,前两个全连接层执行丢弃正则化(丢弃率设定为0.5)。
|
84 | conjecture (1) | [kənˈdʒektʃə(r)] |
- We conjecture that in spite of the larger number of parameters and the greater depth of our nets compared to (Krizhevsky et al. , 2012), the nets required less epochs to converge due to (a) implicit regularisation imposed by greater depth and smaller conv. filter sizes; (b) pre-initialisation of certain layers.我们推测,尽管与(Krizhevsky等,2012)相比我们的网络参数更多,网络的深度更大,但网络需要更小的epoch就可以收敛,这是由于(a)由更大的深度和更小的卷积滤波器尺寸引起的隐式正则化,(b)某些层的预初始化。
|
85 | stall (1) | [stɔ:l] |
- The initialisation of the network weights is important, since bad initialisation can stall learning due to the instability of gradient in deep nets.网络权重的初始化是重要的,因为由于深度网络中梯度的不稳定,不好的初始化可能会阻碍学习。
|
86 | instability (1) | [ˌɪnstəˈbɪləti] |
- The initialisation of the network weights is important, since bad initialisation can stall learning due to the instability of gradient in deep nets.网络权重的初始化是重要的,因为由于深度网络中梯度的不稳定,不好的初始化可能会阻碍学习。
|
87 | circumvent (1) | [ˌsɜ:kəmˈvent] |
- To circumvent this problem, we began with training the configuration A (Table 1), shallow enough to be trained with random initialisation.为了规避这个问题,我们开始训练配置A(表1),足够浅以随机初始化进行训练。
|
88 | variance (1) | [ˈveəriəns] |
- For random initialisation (where applicable), we sampled the weights from a normal distribution with the zero mean and $10^{−2}$ variance.对于随机初始化(如果应用),我们从均值为0和方差为$10^{−2}$的正态分布中采样权重。
|
89 | Glorot (1) | |
- It is worth noting that after the paper submission we found that it is possible to initialise the weights without pre-training by using the random initialisation procedure of Glorot & Bengio (2010).值得注意的是,在提交论文之后,我们发现可以通过使用Glorot & Bengio(2010)的随机初始化程序来初始化权重而不进行预训练。
|
90 | Bengio (1) | |
- It is worth noting that after the paper submission we found that it is possible to initialise the weights without pre-training by using the random initialisation procedure of Glorot & Bengio (2010).值得注意的是,在提交论文之后,我们发现可以通过使用Glorot & Bengio(2010)的随机初始化程序来初始化权重而不进行预训练。
|
91 | SGD (1) | ['esdʒ'i:d'i:] |
- To obtain the fixed-size 224×224 ConvNet input images, they were randomly cropped from rescaled training images (one crop per image per SGD iteration).为了获得固定大小的224×224 ConvNet输入图像,它们从归一化的训练图像中被随机裁剪(每个图像每次SGD迭代进行一次裁剪)。
|
92 | isotropically-rescaled (1) | |
- Let S be the smallest side of an isotropically-rescaled training image, from which the ConvNet input is cropped (we also refer to S as the training scale).令S是等轴归一化的训练图像的最小边,ConvNet输入从S中裁剪(我们也将S称为训练尺度)。
|
93 | whole-image (1) | [!≈ həʊl ˈɪmɪdʒ] |
- While the crop size is fixed to 224 × 224, in principle S can take on any value not less than 224: for S = 224 the crop will capture whole-image statistics, completely spanning the smallest side of a training image; for S ≫ 224 the crop will correspond to a small part of the image, containing a small object or an object part.虽然裁剪尺寸固定为224×224,但原则上S可以是不小于224的任何值:对于S=224,裁剪图像将捕获整个图像的统计数据,完全扩展训练图像的最小边;对于S»224,裁剪图像将对应于图像的一小部分,包含小对象或对象的一部分。
|
94 | isotropically (1) | |
- First, it is isotropically rescaled to a pre-defined smallest image side, denoted as Q (we also refer to it as the test scale).首先,将其等轴地归一化到预定义的最小图像边,表示为Q(我们也将其称为测试尺度)。
|
95 | sum-pooled (1) | [!≈ sʌm 'pu:ld] |
- Finally, to obtain a fixed-size vector of class scores for the image, the class score map is spatially averaged (sum-pooled).最后,为了获得图像的类别分数的固定大小的向量,类得分图在空间上平均(和池化)。
|
96 | comparable (1) | [ˈkɒmpərəbl] |
- While we believe that in practice the increased computation time of multiple crops does not justify the potential gains in accuracy, for reference we also evaluate our networks using 50 crops per scale (5 × 5 regular grid with 2 flips), for a total of 150 crops over 3 scales, which is comparable to 144 crops over 4 scales used by Szegedy et al. (2014).虽然我们认为在实践中,多裁剪图像的计算时间增加并不足以证明准确性的潜在收益,但作为参考,我们还在每个尺度使用50个裁剪图像(5×5规则网格,2次翻转)评估了我们的网络,在3个尺度上总共150个裁剪图像,与Szegedy等人(2014)在4个尺度上使用的144个裁剪图像。
|
97 | Caffe (1) | |
- Our implementation is derived from the publicly available C++ Caffe toolbox (Jia, 2013) (branched out in December 2013), but contains a number of significant modifications, allowing us to perform training and evaluation on multiple GPUs installed in a single system, as well as train and evaluate on full-size (uncropped) images at multiple scales (as described above).我们的实现来源于公开的C++ Caffe工具箱(Jia,2013)(2013年12月推出),但包含了一些重大的修改,使我们能够对安装在单个系统中的多个GPU进行训练和评估,也能训练和评估在多个尺度上(如上所述)的全尺寸(未裁剪)图像。
|
98 | synchronous (1) | [ˈsɪŋkrənəs] |
- Gradient computation is synchronous across the GPUs, so the result is exactly the same as when training on a single GPU.梯度计算在GPU之间是同步的,所以结果与在单个GPU上训练完全一样。
|
99 | conceptually (1) | [kən'septʃʊəlɪ] |
- While more sophisticated methods of speeding up ConvNet training have been recently proposed (Krizhevsky, 2014), which employ model and data parallelism for different layers of the net, we have found that our conceptually much simpler scheme already provides a speedup of 3.75 times on an off-the-shelf 4-GPU system, as compared to using a single GPU.最近提出了更加复杂的加速ConvNet训练的方法(Krizhevsky,2014),它们对网络的不同层之间采用模型和数据并行,我们发现我们概念上更简单的方案与使用单个GPU相比,在现有的4-GPU系统上已经提供了3.75倍的加速。
|
100 | Titan (1) | [ˈtaɪtn] |
- On a system equipped with four NVIDIA Titan Black GPUs, training a single net took 2–3 weeks depending on the architecture.在配备四个NVIDIA Titan Black GPU的系统上,根据架构训练单个网络需要2-3周时间。
|
101 | held-out (1) | [!≈ held aʊt] |
- The dataset includes images of 1000 classes, and is split into three sets: training (1.3M images), validation (50K images), and testing (100K images with held-out class labels).数据集包括1000个类别的图像,并分为三组:训练(130万张图像),验证(5万张图像)和测试(留有类标签的10万张图像)。
|
102 | incorrectly (1) | [ˌɪnkə'rektlɪ] |
- The former is a multi-class classification error, i.e. the proportion of incorrectly classified images; the latter is the main evaluation criterion used in ILSVRC, and is computed as the proportion of images such that the ground-truth category is outside the top-5 predicted categories.前者是多类分类误差,即不正确分类图像的比例;后者是ILSVRC中使用的主要评估标准,并且计算为图像真实类别在前5个预测类别之外的图像比例。
|
103 | jitter (1) | ['dʒɪtə] |
- The test image size was set as follows: Q = S for fixed S, and $Q = 0.5$($S_{min} + S_{max}$) for jittered $S \in [S_{min}, S_{max}]$.测试图像大小设置如下:对于固定S的Q = S,对于抖动$S \in [S_{min}, S_{max}]$,$Q = 0.5$($S_{min} + S_{max}$)。
|
104 | A-LRN (1) | |
- First, we note that using local response normalisation (A-LRN network) does not improve on the model A without any normalisation layers.首先,我们注意到,使用局部响应归一化(A-LRN网络)在没有任何归一化层的情况下,对模型A没有改善。
|
105 | B–E (1) | [!≈ bi: i:] |
- We thus do not employ normalisation in the deeper architectures (B–E).因此,我们在较深的架构(B-E)中不采用归一化。
|
106 | saturate (1) | [ˈsætʃəreɪt] |
- The error rate of our architecture saturates when the depth reaches 19 layers, but even deeper models might be beneficial for larger datasets.当深度达到19层时,我们架构的错误率饱和,但更深的模型可能有益于较大的数据集。
|
107 | discrepancy (1) | [dɪsˈkrepənsi] |
- Considering that a large discrepancy between training and testing scales leads to a drop in performance, the models trained with fixed S were evaluated over three test image sizes, close to the training one: $Q = \{S − 32, S, S + 32\}$.考虑到训练和测试尺度之间的巨大差异会导致性能下降,用固定S训练的模型在三个测试图像尺度上进行了评估,接近于训练一次:$Q = \{S − 32, S, S + 32\}$。
|
108 | mult-crop (1) | |
- In Table 5 we compare dense ConvNet evaluation with mult-crop evaluation (see Sect. 3.2 for details).在表5中,我们将稠密ConvNet评估与多裁剪图像评估进行比较(细节参见第3.2节)。
|
109 | Clarifai (1) | |
- Our result is also competitive with respect to the classification task winner (GoogLeNet with 6.7% error) and substantially outperforms the ILSVRC-2013 winning submission Clarifai, which achieved 11.2% with outside training data and 11.7% without it.我们的结果对于分类任务获胜者(GoogLeNet具有6.7%的错误率)也具有竞争力,并且大大优于ILSVRC-2013获胜者Clarifai的提交,其使用外部训练数据取得了11.2%的错误率,没有外部数据则为11.7%。
|
110 | ERC (1) | [!≈ i: ɑ:(r) si:] |
- This work was supported by ERC grant VisRec no. 228180. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the GPUs used for this research.这项工作得到ERC授权的VisRec编号228180的支持.我们非常感谢NVIDIA公司捐赠GPU为此研究使用。
|
111 | VisRec (1) | |
- This work was supported by ERC grant VisRec no. 228180. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the GPUs used for this research.这项工作得到ERC授权的VisRec编号228180的支持.我们非常感谢NVIDIA公司捐赠GPU为此研究使用。
|
112 | gratefully (1) | ['ɡreɪtfəlɪ] |
- This work was supported by ERC grant VisRec no. 228180. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the GPUs used for this research.这项工作得到ERC授权的VisRec编号228180的支持.我们非常感谢NVIDIA公司捐赠GPU为此研究使用。
|
113 | irrespective (1) | [ˌɪrɪ'spektɪv] |
- It can be seen as a special case of object detection, where a single object bounding box should be predicted for each of the top-5 classes, irrespective of the actual number of objects of the class.它可以被看作是对象检测的一种特殊情况,其中应该为前5个类中的每一个预测单个对象边界框,而不考虑该类的实际对象数量。
|
114 | fewmodifications (1) | |
- For this we adopt the approach of Sermanet et al. (2014), the winners of the ILSVRC-2013 localisation challenge, with a fewmodifications.为此,我们采用Sermanet等人(2014)的方法,仅作了几处修改。Sermanet等人是ILSVRC-2013定位挑战的获胜者。
|
115 | class-specific (1) | [!≈ klɑ:s spəˈsɪfɪk] |
- There is a choice of whether the bounding box prediction is shared across all classes (single-class regression, SCR (Sermanet et al. , 2014)) or is class-specific (per-class regression, PCR).可以选择边界框预测是跨所有类别共享(单个类别回归,SCR(Sermanet et al.,2014))或是特定类别(逐个类别回归,PCR)。
|
116 | Euclidean (1) | [ju:ˈklidiən] |
- The main difference is that we replace the logistic regression objective with a Euclidean loss, which penalises the deviation of the predicted bounding box parameters from the ground-truth.主要的区别是我们用欧几里得损失代替逻辑回归目标,这惩罚了预测的边界框参数与实际值的偏差。
|
117 | penalise (1) | ['pi:nəlaɪz] |
- The main difference is that we replace the logistic regression objective with a Euclidean loss, which penalises the deviation of the predicted bounding box parameters from the ground-truth.主要的区别是我们用欧几里得损失代替逻辑回归目标,这惩罚了预测的边界框参数与实际值的偏差。
|
118 | deviation (1) | [ˌdi:viˈeɪʃn] |
- The main difference is that we replace the logistic regression objective with a Euclidean loss, which penalises the deviation of the predicted bounding box parameters from the ground-truth.主要的区别是我们用欧几里得损失代替逻辑回归目标,这惩罚了预测的边界框参数与实际值的偏差。
|
119 | intersection (1) | [ˌɪntəˈsekʃn] |
- The localisation error is measured according to the ILSVRC criterion (Russakovsky et al. , 2014), i.e. the bounding box prediction is deemed correct if its intersection over union ratio with the ground-truth bounding box is above 0.5.根据ILSVRC标准测量定位误差(Russakovsky等人,2014),即如果边界框预测与实际边界框的相交超过并比大于0.5,则认为其是正确的。
|
120 | class-agnostic (1) | [!≈ klɑ:s ægˈnɒstɪk] |
- Settings comparison. As can be seen from Table 8, per-class regression (PCR) outperforms the class-agnostic single-class regression (SCR), which differs from the findings of Sermanet et al. (2014), where PCR was outperformed by SCR.设置比较。从表8可以看出,逐类回归(PCR)优于类不可知的单类回归(SCR),这与Sermanet等人(2014)的发现不同,后者的PCR表现优于SCR。
|
121 | noticeably (1) | ['nəʊtɪsəblɪ] |
- We also note that fine-tuning all layers for the localisation task leads to noticeably better results than fine-tuning only the fully-connected layers (as done in (Sermanet et al. , 2014)).我们还注意到,为本地化任务微调所有层比仅微调完全连接的层(如(Sermanet et al.,2014)中所做的)会导致明显更好的结果。
|
122 | densely-computed (1) | [!≈ denslɪ kəmˈpju:tid] |
- Having determined the best localisation setting (PCR, fine-tuning of all layers), we now apply it in the fully-fledged scenario, where the top-5 class labels are predicted using our best-performing classification system (Sect. 4.5), and multiple densely-computed bounding box predictions are merged using the method of Sermanet et al. (2014).在确定了最佳定位设置(PCR,所有层的微调)之后,我们现在将其应用于完全成熟的场景中,其中使用我们性能最佳的分类系统预测的top-5个类别标签(Sect. 4.5),并且使用Sermanet等人(2014年)的方法合并多个密集计算的边界框预测。
|
123 | Overfeat (1) | |
- Notably, our results are considerably better than those of the ILSVRC-2013 winner Overfeat (Sermanet et al. , 2014), even though we used less scales and did not employ their resolution enhancement technique.值得注意的是,我们的结果比ILSVRC-2013获奖者Overfeat(Sermanet等人,2014)的结果要好得多,尽管我们使用了更少的比例并且没有使用他们的分辨率增强技术。
|
124 | envisage (1) | [ɪnˈvɪzɪdʒ] |
- We envisage that better localisation performance can be achieved if this technique is incorporated into our method.我们设想,如果将这种技术结合到我们的方法中,可以获得更好的定位性能。
|
125 | extractor (1) | [ɪkˈstræktə(r)] |
- In this section, we evaluate our ConvNets, pre-trained on ILSVRC, as feature extractors on other, smaller, datasets, where training large models from scratch is not feasible due to over-fitting.在本节中,我们将在ILSVRC上预训练的ConvNets评估为其他较小数据集上的特征提取器,其中由于过度拟合,从头训练大型模型是不可行的。
|
126 | over-fitting (1) | [!≈ ˈəʊvə(r) ˈfɪtɪŋ] |
- In this section, we evaluate our ConvNets, pre-trained on ILSVRC, as feature extractors on other, smaller, datasets, where training large models from scratch is not feasible due to over-fitting.在本节中,我们将在ILSVRC上预训练的ConvNets评估为其他较小数据集上的特征提取器,其中由于过度拟合,从头训练大型模型是不可行的。
|
127 | Donahue (1) | |
- Recently, there has been a lot of interest in such a use case (Zeiler & Fergus, 2013; Donahue et al. , 2013; Razavian et al. , 2014; Chatfield et al. , 2014), as it turns out that deep image representations, learnt on ILSVRC, generalise well to other datasets, where they have outperformed hand-crafted representations by a large margin.最近,人们对这样一个用例非常感兴趣(Zeiler&Fergus,2013;Donahue等人,2013;Razvian等人,2014;Chatfield等人,2014),因为事实证明,在ILSVRC上学习的深层图像表示很好地推广到其他数据集,在这些数据集中,它们的表现远远超过手工表示。
|
128 | Razavian (1) | |
- Recently, there has been a lot of interest in such a use case (Zeiler & Fergus, 2013; Donahue et al. , 2013; Razavian et al. , 2014; Chatfield et al. , 2014), as it turns out that deep image representations, learnt on ILSVRC, generalise well to other datasets, where they have outperformed hand-crafted representations by a large margin.最近,人们对这样一个用例非常感兴趣(Zeiler&Fergus,2013;Donahue等人,2013;Razvian等人,2014;Chatfield等人,2014),因为事实证明,在ILSVRC上学习的深层图像表示很好地推广到其他数据集,在这些数据集中,它们的表现远远超过手工表示。
|
129 | hand-crafted (1) | [,hænd 'kra:ftid] |
- Recently, there has been a lot of interest in such a use case (Zeiler & Fergus, 2013; Donahue et al. , 2013; Razavian et al. , 2014; Chatfield et al. , 2014), as it turns out that deep image representations, learnt on ILSVRC, generalise well to other datasets, where they have outperformed hand-crafted representations by a large margin.最近,人们对这样一个用例非常感兴趣(Zeiler&Fergus,2013;Donahue等人,2013;Razvian等人,2014;Chatfield等人,2014),因为事实证明,在ILSVRC上学习的深层图像表示很好地推广到其他数据集,在这些数据集中,它们的表现远远超过手工表示。
|
130 | state-of-the-artmethods (1) | |
- Following that line of work, we investigate if our models lead to better performance than more shallow models utilised in the state-of-the-artmethods.遵循这一工作路线,我们研究我们的模型是否比现有技术中使用的更浅的模型具有更好的性能。
|
131 | penultimate (1) | [penˈʌltɪmət] |
- To utilise the ConvNets, pre-trained on ILSVRC, for image classification on other datasets, we remove the last fully-connected layer (which performs 1000-way ILSVRC classification), and use 4096-D activations of the penultimate layer as image features, which are aggregated across multiple locations and scales.为了利用在ILSVRC上预先训练的ConvNets对其他数据集进行图像分类,我们删除了最后一个完全连接的层(它执行1000种ILSVRC分类),并使用倒数第二层的4096-D激活作为图像特征,这些图像特征在多个位置和规模上聚合。
|
132 | horizontally (1) | [ˌhɒrɪ'zɒntəlɪ] |
- The descriptor is then averaged with the descriptor of a horizontally flipped image.然后将描述符与水平翻转图像的描述符进行平均。
|
133 | optimally (1) | ['əptəməli] |
- Stacking allows a subsequent classifier to learn how to optimally combine image statistics over a range of scales; this, however, comes at the cost of the increased descriptor dimensionality.堆叠允许随后的分类器学习如何在一定范围内最佳地组合图像统计数据;然而,这是以增加的描述符维数为代价的。
|
134 | inflate (1) | [ɪnˈfleɪt] |
- Since averaging has a benefit of not inflating the descriptor dimensionality, we were able to aggregated image descriptors over a wide range of scales: $Q \in \{256, 384, 512, 640, 768\}$.由于平均具有不膨胀描述符维度的优点,我们能够在广泛的范围内聚合图像描述符:$Q \in \{256,384,512,640,768\}$中。
|
135 | pretrained (1) | |
- Our methods set the new state of the art across image representations, pretrained on the ILSVRC dataset, outperforming the previous best result of Chatfield et al. (2014) by more than 6%.我们的方法在图像表示上设置了新的技术状态,在ILSVRC数据集上进行了预训练,性能优于Chatfield等人(2014)之前的最佳结果有超过6%。
|
136 | detection-assisted (1) | [!≈ dɪˈtekʃn əˈsistid] |
- It also benefits from the fusion with an object detection-assisted classification pipeline.它还受益于与对象检测辅助分类流水线的融合。
|
137 | Griffin (1) | [ˈgrɪfɪn] |
- In this section we evaluate very deep features on Caltech-101 (Fei-Fei et al. , 2004) and Caltech-256 (Griffin et al. , 2007) image classification benchmarks.在本节中,我们评估了Caltech-101(Fei-Fei等人,2004)和Caltech-256(Griffin等人,2007)图像分类基准的非常深入的特征。
|
138 | task-specific (1) | [!≈ tɑ:sk spəˈsɪfɪk] |
- Unlike other approaches, we did not incorporate any task-specific heuristics, but relied on the representation power of very deep convolutional features.与其他方法不同,我们没有包含任何特定于任务的启发式方法,而是依赖于非常深的卷积特征的表示能力。
|
139 | heuristic (1) | [hjuˈrɪstɪk] |
- Unlike other approaches, we did not incorporate any task-specific heuristics, but relied on the representation power of very deep convolutional features.与其他方法不同,我们没有包含任何特定于任务的启发式方法,而是依赖于非常深的卷积特征的表示能力。
|
140 | consistently (1) | [kən'sɪstəntlɪ] |
- Since the public release of our models, they have been actively used by the research community for a wide range of image recognition tasks, consistently outperforming more shallow representations.自从我们的模型公开发布以来,研究界一直在积极地使用它们来完成广泛的图像识别任务,始终优于更浅的表示。
|
141 | Girshick (1) | |
- For instance, Girshick et al. (2014) achieve the state of the object detection results by replacing the ConvNet of Krizhevsky et al. (2012) with our 16-layer model.例如,Girshick等人(2014)通过使用我们的16层模型替换Krizhevsky等人(2012)的ConvNet来实现对象检测结果的状态。
|
142 | Kiros (1) | |
- Similar gains over a more shallow architecture of Krizhevsky et al. (2012) have been observed in semantic segmentation (Long et al. , 2014), image caption generation (Kiros et al. , 2014; Karpathy & Fei-Fei, 2014), texture and material recognition (Cimpoi et al. , 2014; Bell et al. , 2014).Krizhevsky等人(2012)在更浅的架构,已经在语义分割(Long等人,2014)、图像字幕生成(Kiros等人,2014;Karpathy&Fei-Fei,2014)、纹理和材料识别(Cimpoi等人,2014;Bell等人,2014)中观察到能获得类似的收益。
|
143 | Karpathy (1) | |
- Similar gains over a more shallow architecture of Krizhevsky et al. (2012) have been observed in semantic segmentation (Long et al. , 2014), image caption generation (Kiros et al. , 2014; Karpathy & Fei-Fei, 2014), texture and material recognition (Cimpoi et al. , 2014; Bell et al. , 2014).Krizhevsky等人(2012)在更浅的架构,已经在语义分割(Long等人,2014)、图像字幕生成(Kiros等人,2014;Karpathy&Fei-Fei,2014)、纹理和材料识别(Cimpoi等人,2014;Bell等人,2014)中观察到能获得类似的收益。
|
144 | Cimpoi (1) | |
- Similar gains over a more shallow architecture of Krizhevsky et al. (2012) have been observed in semantic segmentation (Long et al. , 2014), image caption generation (Kiros et al. , 2014; Karpathy & Fei-Fei, 2014), texture and material recognition (Cimpoi et al. , 2014; Bell et al. , 2014).Krizhevsky等人(2012)在更浅的架构,已经在语义分割(Long等人,2014)、图像字幕生成(Kiros等人,2014;Karpathy&Fei-Fei,2014)、纹理和材料识别(Cimpoi等人,2014;Bell等人,2014)中观察到能获得类似的收益。
|
145 | post-submission (1) | [!≈ pəʊst səbˈmɪʃn] |
- v2 Adds post-submission ILSVRC experiments with training set augmentation using scale jittering, which improves the performance.V2增加了提交后的ILSVRC实验,使用比例抖动增强训练集,从而提高了性能。
|
146 | Camera-ready (1) | [!≈ ˈkæmərə ˈredi] |
- v6 Camera-ready ICLR-2015 conference paper.V6 复印就绪ICLR-2015会议论文。
|